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1 ' Through the Internet and the World-Wide Web, a vast number of information sources 

fi , has become available, which offer information on various subjects by different providers, 

often in heterogeneous formats. This calls for tools and methods for building an advanced 

information-processing infrastructure. One issue in this area is the selection of suitable 

information sources in query answering. In this paper, we present a knowledge-based 

approach to this problem, in the setting where one among a set of information sources 

QQ I (prototypically, data repositories) should be selected for evaluating a user query. We use 

f^ , extended logic programs (ELPs) to represent rich descriptions of the information sources, 

^+ ' an underlying domain theory, and user queries in a formal query language (here, XML-QL, 

f^ , but other languages can be handled eis well). Moreover, we use ELPs for declarative query 

^O ' analysis and generation of a query description. Central to our approach are declarative 

^^ ' source- selection programs, for which we define syntax and semantics. Due to the structured 

j/j , nature of the considered data items, the semantics of such programs must carefully respect 

O ' implicit context information in source-selection rules, and furthermore combine it with 

possible user preferences. A prototype implementation of our approach has been realized 

exploiting the DLV KR system and its pip front-end for prioritized ELPs. We describe a 

representative example involving specific movie databases, and report about experimental 

\—i ' results. 

C^ , 

KEYWORDS: knowledge representation, nonmonotonic reasoning, logic programming, 
answer-set programming, information-source selection, data repositories, preference han- 
dling. 



1 Introduction 

Through the Internet and the World-Wide Web (WWW) , a wealth of information 
has become available to a large group of users. A huge number of documents, files, 
and data repositories on a range of subjects are offered by different providers, which 



* Part of the material in this paper has appeared, in preliminary form, in the Proceedings of 
the Eighth International Conference on Principles of Knowledge Representation and Reasoning 
(KR '02), pp. 49-60, April 22-25, Toulouse, France, 2002. 
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may be non-profit individuals, organizations, or companies. Such data repositories 
are currently in heterogeneous formats, but the trend is that XML becomes a future 
de-facto standard for releasing data on the Web, since this eases data exchange. 
Nonetheless, the quality of their contents may differ significantly with respect to 
aspects such as their accuracy, coverage of certain topics and completeness for them, 
or refresh cycle, to mention just a few. 

Accessing and processing data on the Web calls for developing tools and methods 
for an advanced information-processing infrastructure. Mediators IjWiederhold 1993|l 
and special information agents ("middle agents" IjDecker et al. 1997 1)). which pro- 
vide various services including finding, selecting, and querying relevant information 
sources, play an important role here. The potential of knowledge-based approaches — 
and in particular of logic programming — for developing reasoning components for 
intelligent information agents is recognized in the AI community and outlined, e.g., 



by Dimopoulos and Kakas (20011, Eiter et al. (2002b I, and Sadri and Toni (20001. 



In this paper, we pursue this issue further and present a declarative approach 
for information-source selection in the following setting. Given a query by a user 
in some formal query language and a suite of information sources over which this 
query might be evaluated, which of these sources is the best to answer the query, 
i.e., such that the utility of the answer, measured by the quality of the result and 
other criteria (e.g., costs), is as large as possible for the user? Note that this problem 
is in fact not bound to information sources on the Web but is of interest in any 
context where different candidate information sources (e.g., scientific databases, 
newspaper archives, stock exchange predictions, etc.) are available and one of them 
should be selected. Selection of a single source may be desired because of (high) cost 
associated with accessing each source, for instance. Furthermore, problems arising 
by integrating data from different sources (like inconsistencies between sources) can 
be circumvented this way. 

For a concrete example, consider the following scenario to illustrate our ideas. 

Example 1 

Assume that some agent has access to XML information sources, si, S2, and S3, 
about movies. Furthermore, suppose that the following XML-QL-'^ query is handed 
to the agent, which informally asks a source for the titles of all movies directed by 
Alfred Hitchcock: 

FUNCTION HitchcockMovies($MovieDB: "Movie. dtd") { 
CONSTRUCT <MovieList> { 
WHERE <MovieDB> <Movie> 

<Title> $t </Title> 
<Director> <Personalia> 

<FirstName> "Alfred" </FirstName> 
<LastName> "Hitchcock" </LastName> 
</Personalia> </Director> 
</Movie> </MovieDB> 
IN source ($MovieDB) 

1 For details about XML-QL, cf. Section IT2I 
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CONSTRUCT <Movie> $t </Movie> 
} </MovieList> } 

Here, $t is a variable into which the value of attribute Title is selected, for usage 
in the resulting construction. Suppose the agent knows that si is a very good source 
for information about directors, while S2 has usually good coverage about person 
data; all that is known about S3, however, is that it is not very reliable. In this 
situation, we would expect that the agent selects si for querying. 

Obviously, a sensible solution to this problem is nontrivial and involves various 
aspects such as taking basic properties of the information sources, knowledge about 
their contents, and knowledge about the particular application domain into account. 
These aspects have to be suitably combined, and reasoning may be needed to elicit 
implicit knowledge. We stress that the general problem considered here is distinct 
from a simple keyword-based search as realized by Web engines like Google,^ and 
consequently we do not propose a method for competing with these tools here.^ In 
fact, we are concerned with qualitative selection from different alternatives, based 
on rich meta-knowledge and a formal semantics, thereby respecting preference and 
context information which involves heuristic defaults. 

Our approach, which incorporates aspects mentioned above, makes several con- 
tributions, which are briefly summarized as follows. 

(1) We base our method on the answer-set programming paradigm, in which 
problems are encoded in terms of nonmonotonic logic programs and solutions are 
extracted from the models of these programs (cf. [Baral (2003j ) for a comprehensive 
treatise on answer-set programming). More precisely, we use extended logic pro- 
grams (ELPs) under the answer-set semantics fGelfond and Lifschitz 1991)), aug- 



mented with priorities (cf., e.g., Brewka and Eiter (1999j , ,Delgrande et al. (2003 1, 
or Inoue and Sakama (2000 ) for work about priorities in answer-set programming) 
and weak constraints IjBuccafurri et al. 20001 ILeone et al. 2006|l , to represent rich 
descriptions of the information sources, an underlying domain theory, and queries in 
a formal language. We perform query analysis by ELPs and compute query descrip- 
tions. Here, we consider XML-QL l|Deutsch et al. 1999)) . but our approach is not 
committed to semi-structured data and XML per se, and other formal query lan- 



guages can be handled as well (e.g., Schindlauer (20021 adopts our query-analysis 



method for the ubiquitous SQL language for relational databases). 

(2) At the heart, a declarative source-selection program represents both qualitative 
and quantitative criteria for source selection, in terms of rules and soft constraints. 
The rules may access information supplied by other programs, including object 
and value occurrences in the query. For example, a rule ri may state that a query 
about a person Alfred Hitchcock should be posed to source si. Furthermore, ordinal 
rule priorities can be employed in order to specify source-selection preference. For 



^ Google's homepage is found at |http : //www ■ google . com| 

^ Note that Google does not index XML files or databases underlying Web query interfaces, and 
hence cannot be readily applied for the purposes considered here. 
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example, a priority may state that a certain rule mentioning a last name in the 
query is preferred over another rule mentioning the concept person only. Rules and 
priorities are of qualitative nature and are taken into account for singling out a 
coherent decision for the source selection in model-theoretic terms. Quantitative 
criteria (like, e.g., cost) are used to discriminate between different such options 
by means of an objective function which is optimized. To this end, conditions in 
terms of conjunctions of literals can be stated whose violation is penalized (e.g., 
the selection of a certain source might be penalized but not strictly forbidden), and 
total penalization is minimized. Such a two step approach seems to be natural and 
provides the user with a range of possibilities to express his or her knowledge and 
selection desires in convenient form. 

(3) We consider the interesting and, to the best of our knowledge, novel issue of 
contexts in nonmonotonic logic programs, which is similar to preference based on 
specificity ( |Delgrande and Schaub 1994|lGeerts and Vermeir 1993IIGeerts and Vermeir 19951 . 
Structured data items require a careful definition of the selection semantics, since an 
attribute might be referenced following a path of indirections, starting from a root 
object and passing through other objects. In Example^ for instance, the attribute 
FirstName is referenced with the path Movie/ Director / Personalia/ FirstName, 
which starts at an object of type Movie and passes through objects of type Director 
and Personalia. Each of these objects opens a context in which FirstName is refer- 
enced along the remaining path. Intuitively, a context is less specific the closer we 
are at the end of the path. Thus, for example, the reference from Personalia is less 
specific than from Movie, and the latter should have higher priority. Note that such 
priority is not based on inheritance (which is tailored for "flat" objects). There- 



fore, inheritance-based approaches such as those by Laenens and Vermeir (1990 ) or 



Buccafurri et al. (19961 do not apply here. Furthermore, implicit priorities derived 
from context information as above must be combined with explicit user preferences 
from the selection policy, and arising conflicts must be resolved. 

(4) We have implemented a prototype, based on the KR system DLV IjLeone et al. 2006|l 
and its front-end pip | |Delgrande et al. 2001| ) for prioritized ELPs, which we used to 
build a model application involving movie information sources. It comprises several 
XML databases, wrapped from movie databases on the Web, and handles queries 
in XML-QL. Experiments that we have conducted showed that the system behaved 
as expected on a number of natural queries, some of which require reasoning from 
the background knowledge to identify the proper selection. 

The reason to use a knowledge-based approach — and in particular an answer- 
set programming approach — for source selection rather than a standard decision- 
theoretic approach based on utility functions is motivated by the following advan- 
tages: 

• Source-selection programs, which are special kinds of extended logic programs, 
are declarative and have a well-defined formal semantics, both under qualita- 
tive as well as under quantitative criteria. 

• The formalism is capable of handling incomplete information and perform- 
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ing nonmonotonic inferences, which, arguably, is an inherent feature of the 
problem domain under consideration. 

• Changes in the specification of the source-selection process are easily incorpo- 
rated by modifying or adding suitable rules or constraints, without the need 
for re-designing the given program, as may be the case, e.g., in procedural 
languages. 

• Finally, the declarative nature of the answer-set semantics formalism permits 
also a coupling with sophisticated ontology tools, as well as with reasoning 
engines for them, providing advanced features for the domain knowledge. In 
particular, the approach of lEiter et al.lH2()()4ll2005b|l . providing a declarative 
coupling of logic programs under the answer-set semantics with description- 
logic knowledge bases, can be integrated into our framework. 

We note that while we focus here on selecting a single source, our approach can 
be easily extended to select multiple information sources, as well as to perform 
ranked selections (cf. Section IfOjl . 

The rest of this paper is organized as follows. The next section contains the 
necessary prerequisites from answer-set programming and XML-QL, and Sectional 
gives a brief outline of our approach. In Sectional we consider the generation of an 
internal query representation, while Section El addresses the modeling of sources. 
Sectional then, is devoted to source-selection programs and includes a discussion of 
some of their properties. The implementation and the movie application, as well as 
experimental results, are the topics of Section [7| Section |S1 addresses related work, 
and Section 1^1 concludes the main part of the paper with a brief summary and open 
research issues. Certain technical details and additional properties of our approach 
are relegated to an appendix. 

2 Preliminaries 

2.1 Answer-set programming 

We recall the basic concepts of answer-set programming. Let £ be a function-free 
first-order language. Throughout this paper, we denote variables by alphanumeric 
strings starting with an upper-case letter, anonymous variables by '_', and constants 
by alphanumeric strings starting with a lower-case letter or by a string in double 
quotes. 

An extended logic program (ELP) IjGelfond and Lifschitz 199T|l is a finite set of 
rules over C of form 

Lq '— Li,... ,L„„notL,n+i,.. .,notLn, (1) 

where each Li, < i < n, is a literal, i.e., an atom ^ or a negated atom ^A, and 
^^noV^ denotes negation as failure, or default negation. Intuitively, a rule of form (^) 
states that we can conclude Lo if (i) Li, . . . , L,n are known and (ii) Lm+i, ■ ■ ■ , Ln 
are not known. For a rule r as above, we call the literal Lq the head of r (denoted 
H{r)) and the set {Li, . . . , L„j, not L^+i , ■ • ■ , not L^} the body of r (denoted B{r)). 
Furthermore, we define B^{r) — {Li,. . . ,Lm} and B^{r) = {Lm+i, . . . ,L„}. If 
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B{r) = 0, then r is called a fact. We write r{Vi, . . . , Vn) to indicate that rule r has 
variables Vi, . . . ,Vn. To ease notation, for any program 11 and any set S of literals, 
n U 5 stands for the program HU {L ^ L E S}. Finally, for a literal L, we write 
-iL to denote its complementary literal, i.e., -^L = A ii L = -^A, and -^L = -^A if 
L = A, for any atom A. 

The semantics of an ELP 11 is given in terms of the semantics of its ground 
instantiation, G{TV), over the Herbrand universe Uc of C, which is the language 
generated by H. The program ^(11) contains all instances of rules from H, i.e., 
where the variables are (uniformly) replaced with arbitrary terms from Uc- Recall 
that a literal, rule, program, etc., is ground iff it contains no variables. In what 
follows, we assume that all such objects are ground. 

An interpretation, X, is a consistent set of (ground) literals, i.e., X does not 
contain a complementary pair A, ^A of literals. A literal, L, is true in X ii L Cz X , 
and false otherwise. The body, B{r), of a rule r is true in X iff (i) each L G B^{r) 
is true in X and (ii) each L G B~{r) is false in X. Rule r is true in X iff -ff(r) is 
true in X whenever B{r) is true in X. Finally, a program, H, is true in X, or X is 
a model of II, iff all rules in II are true in X. We write X \= a to indicate that an 
object a, which may be either a literal, the body of a rule, a rule, or a program, is 
true in X. 

Let AT be a set of literals and II a program. The Gelfond-Lifschitz reduct, or 
simply reduct, 11^ , of II relative to X is given by 

n-^ = {H{r) ^ B+{r) | r G H and B"(r) n AT := }. 

We call X an answer set of II iff X is a minimal model of fl"^ with respect to set 
inclusion. Observe that any answer set of II is a fortiori a model of II. The set of 
all generating rules of an answer set X with respect to II is given by 

GR{X, n) = {r G n I a: 1= B{r)}. 

Example 2 

Let n = { s <— nott; n ^ ; t ^ n, not s; w <— i}. For the interpretation Xi ~ 
{n, t, w}, we have 11^^ = {n ^ ; t ^^ n; w ^ t}. Clearly, Xi is a minimal model of 
XI"^"^, and thus Xi is an answer set of II. Note that X2 = {«,"} is another answer 
set of n. 

A (possibly non-ground) program II is locally stratified ( |Przymusinski 1988| ) iff 
there exists a mapping A assigning each literal occurring in 5(11) a natural number 
such that, for each rule r G 5(11), it holds that (i) X{H{r)) > max^gB+(r) H^) ^^'^ 
(ii) X{H{r)) > max^gB-(r) K^)- Note that His (globally) stratified ^Apt et al. 1988| ) 
if, additionally, for all positive (resp., negative) literals L and L' with the same pred- 
icate, A(L) = A(L') holds. It is well-known that if a program is locally stratified, 
then it has at most one answer set. 

A refinement of the answer-set semantics is the admission of preferences among 
the rules of a given ELP, yielding the class of prioritized logic programs. Several 
approaches in this respect have been introduced in the literature, like, e.g., those 
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by Brewka and Eiter (19991 or|Inoue and Sakama (2000*); here, we use a preference 



approach based on a method due to _Dclgrande et al. (2003| , defined as follows. 

Let n be an ELP and < a strict partial order between the elements of 11 (i.e., < 
is an irreflexive and transitive relation). Informally, for rules ri, r2 € H, the relation 
ri < r2 expresses that r2 has preference over ri. Define the relation <g over the 
ground instantiation 5(11) of 11 by setting fi <g f2 iff ri < r2, for fi,f2 G 5(n). 
Then, the pair (11, <) is called a prioritized extended logic program, or simply a 
prioritized logic program, if the relation <q is a strict partial order. 

The semantics of prioritized programs is as follows. Let (11, <) be a prioritized 
logic program where H is ground, and let X be an answer set of H. We call X a 
preferred answer set of (11, <) iff there exists an enumeration {ri)ii=i of GR{X,IV) 
such that, for every i,j G /, we have that: 

(Pi) B+{r,) C {H{rk) \ k < i}; 
(P2) if ^i < fj, then j < i; and 
(F3) if r, < r' and r' eU\ GR{X,U), then B+{r') ^ X or B"(r') n {H{rk) \ k < 

Conditions (Pi)-(P3) realize a strongly "prescriptive" interpretation of prefer- 
ence, in the sense that, whenever ri < r2 holds, it is ensured that r2 is known to be 
applied or blocked ahead of ri (with respect to the order of rule application) . More 
specifically, (F2) guarantees that all generating rules are applied according to the 
given order, whilst (P3) assures that any preferred yet inapplicable rule is either 
blocked due to the non-derivability of its prerequisites or because it is defeated by 



higher-ranked or unrelated rules. As shown by Delgrande et al. (2003 1, the selection 
of preferred answer sets can be encoded by means of a suitable translation from 
prioritized logic programs into standard ELPs. 

Preferred answer sets of a prioritized program (11, <) where H is non-ground are 
given by the preferred answer sets of the prioritized program (5(n), <g), where <g 
is as above. Note that the concept of prioritization realizes a filtering of the answer 
sets of a given program H, as every preferred answer set of (11, <) is an answer set 
of n, but not vice versa. 

Besides imposing qualitative selection criteria, like assigning preferences between 
different rules, another refinement of the answer-set semantics are weak constraints 
l|Buccafurri et al. 20001 ILeone et al. 2006|l , representing a quantitative filtering of 
answer sets. Formally, a weak constraint is an expression of form 

<= Li,. .., Lra, not Lm+i, . . . , not L„ [w : I], (2) 

where each Li, 1 < i < n, is a literal (not necessarily ground) and w,l > 1 are 
natural numbers.^ The number w is the weight and I is the priority level of the weak 
constraint lj2Il. Given an interpretation X , the weight of a ground weak constraint 
c of the above form with respect to a level /' > 1, weighty I'i^)^ is w, ii X \= Li, 

^ The part "\w : Z| " is convenient syntactic sugar for the original definition by 
[Buccafurri et al. (2000| , which merely provided a partitioning of the weak constraints in priority 
levels. 
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1 < i < m, X ^ Lm+j, 1 < J < "-, and I' — I, and otherwise; the weight of 
a non-ground weak constraint c with respect to a level I, weighty i(X), is given 
by "^^r^cfc) weighty, i{X), where G{c) denotes the set of all ground instances of c. 
Weak constraints select then those answer sets X of the weak-constraint-free part 
of a program 11 for which the associated vector 

'weights{X) = {weighty {X),weighti _ {X), . . . ,weight^^{X)) 

is lexicographic smallest, where /^ax is the highest priority level occurring and 
weighti{X) = J2c£wc(n) ^^^9^^ c,i{^) ^ ^i' each I, with wc{Jl) denoting the set of 
all weak constraints occurring in 11. Informally, first those answer sets are pruned 
for which the weight of violated constraints is not minimal at the highest priority 
level; from the remaining answer sets, those are pruned where the sum of weights of 
violated constraints in the next lower level is not minimal, and so on. For example, 
if we add in Example |21 the weak constraints ci : <^ n,not w [3 : 1] and C2'. <= 
i, w [1 : 2], then we have weights{Xi) = (1,0) and weights{X2) = (0,3); hence, the 
answer set Xi is discarded. 

The numeric lexicographic preference can be reduced by usual techniques to an 
objective function H^{X), which assigns each answer set X an integer such that 
those answer sets X for which H^{X) is minimal are precisely those for which 
weights{X) is lexicographically smallest. In the above example, X2 is selected as 
the "optimal" answer set. While the availability of both weights and levels is syn- 
tactic sugar, they are very useful for expressing preferences in a more natural and 
convenient form. In the example above, putting C2 at level 2 dominates ci which is 
at level 1. Weights within the same layer can be used for fine-tuning. For formal 



details and more discussion, we refer the reader to Leone et al. (2006 1 



2.2 XML-QL 

We next introduce basic concepts of XML-QL (|Deutsch et al. 1999|l . a query lan- 
guage for data stored in the Extensible Markup Language (XML). We assume that 
the reader is familiar with XML, which has emerged as a standard for provid- 
ing (semi-structured) data on the Web. While syntactically similar to the Hy- 
pertext Markup Language (HTML), features have been added in XML for data- 
representation purposes such as user-defined tags and nested elements. Unlike re- 
lational or object-oriented data, XML is semi- structured, i.e., it can have irregular 
(and extensible) structure and attributes (or schemas) are stored with the data. The 
structure of an XML document can be optionally modeled and validated against a 
Document Type Descriptor (DTD). In this paper, we take a database-oriented view 
of XML documents, considering them as databases and a corresponding DTD as 
its database schema. For a comprehensive introduction to semistructured data and 



database aspects about them, we refer to Abiteboul et al. (2000 1 . 



XML-QL is a declarative, relationally complete query language for XML data, 
which can not only query XML data, but also construct new XML documents from 
query answers, i.e., it can also be used to restructure XML data. Its syntax deviates 
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from the well-known "select-from- where syntax" of the Structured Query Language 
(SQL), but can be decomposed into three syntactical units as well: 

1. a where part (keyword WHERE), specifying a selection condition by element 
reference and comparison predicates; 

2. a source part (keyword IN), declaring a data source for the query (an external 
file, or an internal variable); and 

3. a construct part (keyword CONSTRUCT), defining a structure for the resulting 
document. 

In the latter part, subqueries can be built by nesting. 

XML-QL uses element patterns to match data in an XML document. Elements 
are referenced by their names and are traversed according to the XML source 
structure. Thus, reference paths can be identified with every matching. Variables 
are in general not bound to elements but to element contents (but syntactic sugar 
exists for element binding). Furthermore, elements can be joined by values using 
the same variable in two matchings, i.e., theta-joins can be expressed. 

Let us briefly illustrate the most basic concepts in the following example; for 



further details, we refer to Deutsch et al. (19991 and Abiteboul et al. (20001. 



Example 3 

Throughout the paper, we consider XML-QL queries stored as XML-QL functions, 
which serves two purposes. First, it allows us to efficiently query several XML 
documents by dynamic bindings of data sources, and second, we can additionally 
specify that a data source has to obey a certain DTD. The following query is 
represented as an XML-QL function. Upon its invocation, the variable $MovieDB 
is instantiated with the name of an XML document to be queried, which has to be 
structured according to the DTD detailed in [Appendix A| 



FUNCTION ExampleQuery($MovieDB: "Movie. dtd") i 

WHERE <MovieDB> <Hovie> $ml </Movie> </MovieDB> IN source ($MovieDB) , 
<Actor> $a </Actor> IN $ml , 

<MovieDB> <Movie> $m2 </Movie> </MovieDB> IN source ($MovieDB) , 
<Actor> $a </Actor> IN $m2, 
$ml != $m2 
CONSTRUCT <x2Actor> $a </x2Actor> 
} 

In the where part of the above query, variables $ml and $m2 are bound to different 
matchings under the reference path MovieDB/Movie. Furthermore, the two match- 
ings are joined by the (common) value of variable $a, referenced under element 
Actor. The construct part of the query creates a new XML document by listing 
the values of $a marked-up with tags <x2Actor>. Intuitively, the query returns all 
actors found in a given XML document about movies which act in at least two 
movies. 
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Fig. 1. Using a selection base S — (nqa,^sd,^dom,^sei, <u) for source selection 
for a query Q. 



3 Overview of the approach 

Before presenting the technical details of our approach, it is helpful to give a short 
overview. While the motivating example in Section ^ is simple, it shows that the 
source-selection process involves different kinds of knowledge, including 

• knowledge about which "interesting" information should be extracted from a 
given formal query expression Q, 

• knowledge about the information sources and their properties, 

• background knowledge about the application domain and the ontology used 
for its formalization, and 

• specific rules which guide the source selection, based on preferences or generic 
principles. 

In our approach, this is formalized in terms of the notion of a selection base 



*-* y^^qaj ^^sdj ^^do7nj ^^seh ^u)j 



where 11,0,11 



djlldom are ELPs, called query-analysis program, source description, 
and domain theory, respectively, and (Ilseh <u) is a prioritized logic program with 
a special syntax, called source-selection program. Given a selection base S as above, 
the possible solutions of a query Q relative to S are determined by the selection an- 
swer sets of the source-selection program (Ilse;, <«), which are defined as preferred 
answer sets of a prioritized logic program £{S, Q) — {Hq, <), associated with S 
and Q, as shown in Figure ^ 

The components of a selection base serve the following purposes: 

Query-analysis program Uqa- For any query Q as in Example ^ a high-level 
description is extracted from a low-level (syntactic) representation, R{Q), given 
as a set of elementary facts, by applying Uqa to R{Q) and ontological knowledge, 
Ont, about concepts (types) and synonyms from the domain theory lidomi in 
terms of facts for predicates class{0) and synonym(Ci, C2). Informally, the rules 
of Ilga single out the essential parts of Q, such as occurrence of attributes and 
values in the query, comparison and joins, or subreference paths of attributes from 
objects on a reference path. For instance, in Example^ the attribute FirstName 
from an object of type Director is referenced via path Personalia/ FirstName on 
the reference path Movie/ Director / Personalia/ FirstName from the root. 

Source description Tlsd '■ This program contains information about the available 



A Knowledge-Based Approach for Selecting Information Sources 11 

sources, using special predicates for query topics, cost aspects, and technical 
aspects. 
Domain theory Udom- The agent's knowledge about the specific application do- 
main (like, e.g., the movie area) is represented in the domain theory Udom- It 
includes ontological knowledge and further background knowledge, permitting 
(modest) common-sense reasoning. The ontology is assumed to have concepts 
(classes), attributes, and instance and subconcept information, which are pro- 
vided via class{0), class-att{C,A), instance {0,C), and is_a{Ci,C2) predicates, 
respectively. Furthermore, it is assumed to have information about concept syn- 
onyms, provided via predicate synonym{Ci, C2). The ontology may be partly es- 
tablished using meta- information about the data in the information sources (e.g., 
an XML DTD), and with ontology rules. Since ontological reasoning is orthog- 



onal to our approach, we do not consider it here and refer to Eiter et al. (2003 1 
for a further elaboration. 
Source-selection program {Ylseii<.u)'- The information source selection is spec- 
ified by rules and constraints, which refer to predicates defined in the above pro- 
grams. It comprises both qualitative aspects and quantitative aspects in terms 
of optimization criteria (concerning, e.g., cost or response time), which are ex- 
pressed using weak constraints l|Buccafurri et al. 2000|l . Furthermore, the user 
can define preferences between rules, in terms of a strict partial order, <„. These 
preferences are combined with implicit priorities that emerge from the context in 
which source selection rules should be applied, and possible preference confiicts 
are resolved. 

Given a query Q, the overall evaluation relative to S, then, proceeds in three 
steps: 

Step 1 (query description): The input query Q is parsed and mapped into the 
internal query representation, R{Q), which is extended using Tlqa and Ont to the 
full query description. 

Step 2 (qualitative selection): From i?((5), n^a, 11^^, and IIiioTn, the qualitative 
part of Tlsei is used to single out different query options by respecting qualitative 
aspects only, where explicit preferences, <„, and implicit priorities must be taken 
into account. To this end, a priority relation < is computed on rules, which is then 
used in a prioritized logic program (IIq, <). Candidate solutions are computed 
as preferred answer sets of (IIq, <). 

Step 3 (optimization): Among the candidates of Step 2, the one is chosen which 
is best under the quantitative aspects of Hsej, and the selected source is output. 

4 Query description 

An integral feature of our approach is a meaningful description of a given formal 
query expression Q in an internal format. For our purposes, we need a suitable 
representation of the constituents of Q in terms of predicates and objects. Simply 
mapping Q (which is represented as a string) to logical facts which encode its 
syntax tree (i.e., the external format) does not serve our purposes. Rather, we need 
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a meta-level description which provides "interesting" information about Q, such as 
occurrence of an attribute or a value in Q, related to the scope of appearance. 

For example, in the query of Example^ the value "Hitchcock" occurs in a selec- 
tion on the attribute LastName reached by the reference path Personalia/ LastName 
from Director. In the internal query representation, this selection will be repre- 
sented by the fact selects{o^, equal, " Hitchcock^^ ) , where 03 is an internal name for 
the full reference path ^^ Movie/ Director / Personalia/ LastName'''' (i.e., the reference 
path from the root), and by a fact creJ{o^, ^^Director^^ , ^^ Personalia / LastName^^ , qi), 
where gi is an internal identifier for the query. Also, a fact occurs{o3, ^^ Hitchcock" ) 
will be present that less specifically states that "Hitchcock" is associated with this 
reference path. 

The general format of these three predicates, which play a vital part in our 
architecture, is as follows: 

• cref{0, C, P, Q) states that within the full reference path O in the syntax tree 
for query Q, the path from C to the leaf is P;^ 

• occurs{0, V) states that the value V is associated with the full reference path 
O in the overall query; and 

• selects {O, R,V) is similar to occurs, but details the association with a com- 
parison operator R. 

In accord to the syntactical units of XML-QL, in our query- analysis method we 
adopt the general view in which a query expression consists of a where part, a source 
part, and a construct part. For the description of Q, we employ facts on designated 
predicates, which are independent of a fixed query language. These facts are divided 
into two groups, which we refer to as parser facts and derived facts, respectively. 



4-1 Parser facts 

The first group of facts, denoted R{Q), is generated by a query parser, and is re- 
garded as a "low-level" part of the query representation. The query parser scans the 
query string Q for extracting "interesting" information, and assembles information 
about structural information (such as about subqueries, and in which of them a 
reference to a certain attribute is made). The main purpose of R{Q) is to filter 
and reduce the information which is present in the syntax tree of Q, and to assem- 
ble it into suitable facts. For that, the parser must introduce identifiers (names) for 
queries, subqueries, and other query constituents — in particular, references to items 
(i.e., attributes or concepts), which in a query are selected or compared to values or 
other items. Every item reference is given by a maximal reference path in Q, which 
we call an item reference path (IRP). The parser names each occurrence of an IRP 
with a unique constant (note that the same IRP may have multiple occurrences 
inQ). 



Note that [Eiter et al. (2003^ and [Fink (2002^ name this predicate access, and reference paths 
are called access paths. 
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Example 4 

In Example n the query is named qi. It has a subquery in the construct <Movie- 

List> part, identified by (72- There are three IRPs, namely 

''Movie/ Title'' , 

''Movie/ Director / Personalia/ FirstName" , and 

"Movie/ Director / Personalia/ LastName" . 

Their identifiers are oi, 02, and 03, respectively. 

4-2 Derived facts 

The second group of facts are those which are derived from R{Q) by means of a 
further analysis. Compared to R{Q), these facts can be regarded as a "high-level" 
description of the query. In particular, for the attribute or concept at the end of 
an IRP, the contexts of reference are determined, which are the suffixes of the 
IRP starting at some concept (as known from the underlying ontology) . Intuitively, 
instances of this concept have the referenced item as a (nested) attribute. Detaching 
the leading concept from the suffix results in the notion of a context-reference pair, 
defined as follows: 

Definition 1 

A pair (C, P) , where C is a concept from the ontology and P is a path, is a context- 
reference pair (CRP) of a query Q if Q contains an IRP with suffix "C /P'' . 

Example 5 

Continuing Example ^ assume that the concepts MovieDB, Movie, Director, and 
Person are in the ontology, and it is known that "Personalia!' is a synonym of 
"Person:' in the ontology Then, from the IRP oi = "Movie/ Title" , the CRPs 

{"MovieDB", "Movie/ Title") and {"Movie", "Title") 

are determined, and from the IRP 02 = "Movie/ Director / Personalia/ LastName" , 
the CRPs 

{"MovieDB" , " Movie / Director / Personalia / FirstName" ) , 
{"Movie" , "Director / Personalia/ FirstName"), 
{"Director" , "Personalia/ FirstName"), and 
{"Personalia" , "FirstName") 

are obtained. 

The high-level description facts are computed declaratively by evaluating a query- 
analysis logic program, n^a, to which the facts R{Q) and further facts Ont, which 
provide ontological knowledge about concepts and synonyms from the domain the- 
ory, are added as "input" . Furthermore, the program enriches the low-level predi- 
cate subpath by synonym information and closing subpath transitively. In summary, 
the query description is given by the (unique) answer set of the logic program 
OntunqaUR{Q). 

A detailed list of all query-description predicates, as well as the complete query- 
analysis program, can be found in [Appendix B| 
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5 Source description 

Besides query information and domain knowledge, the source-selection process re- 
quires a suitable description of the information sources to select from. This is pro- 
vided by means of meta-knowledge represented in the source-description part of 
the knowledge base, given in the form of a (simple) logic program, Ilsd, which is 
assumed to have a unique answer set. Different predicates can be used for this pur- 
pose, depending on the specific application. In the following, we introduce, in an 
exemplary fashion, a basic suite of predefined source- description predicates, which 
cover several aspects of an information source: 

(i) Thematic aspects: 

• accurate{S, T, V): source S, topic T, value V; 

• covers{S, T, V): source S, topic T, value V; 

• specialized {S,T): source S, topic T; 

• relevant{S,T): source S, topic T. 

The first two predicates express the accuracy and coverage of a source for a 
topic, using values from {low, med, high}. The others are for stating that a 
source is specialized or relevant for a particular topic, respectively. 
(ii) Cost aspects: 

• avg_download_time{S,V): source S, value V; 

• avg_down_time{S,V): source S, value V; 

• charge{S,V): source S, value V. 

Costs for accessing an information source can be expressed by these pred- 
icates, again using values low, med, high, and, for charge, also no. While 
charge is used for direct costs, avg -download _time and avg -down-time are 
indirect costs (taking network traffic into account), 
(iii) Technical aspects: 

• source -type{S,Ti,T2): source S, organizational type Ti, query type T2; 



• 



source-language{S, L): source S, language L; 



• 



• data-format{S,F): source S, format F; 
update -frequency {S,V): source S, value V; 

• last -Update {S,D): source S, date D; 

• reliable{S,V): source 5*, value V; 

• source{S): source S; 

• up{S): source S. 



Different kinds of sources are distinguished by their type of organization (com- 
mercial or public) and by the type of data access provided (queryable, down- 
loadable, or both). Besides source language and data format (XML, relational, 
HTML, text, or other), the frequency of data update (low, medium, or high), 
the date of the last update, or the reliability of a source (low, medium, or high) 
may be criteria for source selection. Finally, source and up are used to identify 
sources and to express that a source is currently accessible, respectively. 
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As already pointed out, the above predicates are just a rudiment of a vocabulary 
for source description, and we are far from claiming that they capture all aspects 
or that they capture each one in sufficient detail or granularity (like, e.g., the 
three- valued scale used). However, the user or administrator has the possibility to 
introduce further predicates and define them in the source-description program Hsd ■ 
Note that Hgd can take advantage of default rules to handle incomplete information, 
e.g., that a source is accessible by default, or that the language of text items is 
English. 

We assume here furthermore that detailed source descriptions are edited by an 
administrator of an overall information system hosting the considered selection 
process. This does not preclude that a preliminary or partial description is cre- 
ated automatically, addressing aspects such as source language, type, data format, 
etc., nor that the information system is open for new sources entering it, adver- 
tising their description to a source registration. However, a number of aspects for 
selection, such as coverage, specialization, or relevance, might be difhcult to assess 
automatically and require experience gained from interaction with a source like in 
real-life scenarios (think of different travel agencies offering flights, for instance). 
Here, the administrator might bring in such knowledge initially, and the descrip- 
tion might be updated in accord to new information obtained, e.g., by performance 
monitoring and user feedback. For updating an employed description, approaches 



such as those discussed by Alferes et al. (20021 or Eiter et al. (2002a I may be ap 



plied. In general, however, this is a complex and interesting issue, but is beyond 
the scope of this paper. 

In concluding, we remark that the proviso that Jlsd possesses a unique answer set 
can be ensured, e.g., by requiring (local) stratification of Hgd, or by the condition 
that its well-founded model is total. In principle, the case of multiple answer sets of 
Hsd could be admitted as well, which would give rise to different scenarios that could 
be handled in different ways; e.g., adhering to a credulous or skeptical reasoning 
principle, according to which the different scenarios are considered en par or such 
that only selections in all scenarios are retained, or to a preference-based approach 
which discriminates between the different scenarios. However, we do not elaborate 
further on this issue. 



6 Source selection 

We now introduce the central part of our architecture, viz. source- selection pro- 
grams. Basically, a source-selection program is a prioritized logic program (H^e;, <„) 
having four parts: (i) a core unit H^^;, containing the actual source-selection rules, 
(ii) a set H™;^ of auxiliary rules, (iii) an order relation <„ defined over members of 
n^gj, and (iv) an optimization part n"^;, containing weak constraints. 



6. 1 Syntax 

We first make the vocabulary of source-selection programs formally precise. 
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Definition 2 

A source-selection vocabulary, Aseh consists of the following pairwise disjoint cate- 
gories: 

(i) function- free vocabularies Aqd, Asd, and Adorm referred to as the query- 
description vocabulary, the source- description vocabulary, and the domain- 
theory vocabulary of Asei , respectively, where Aqd and Asd contain the pred- 
icates introduced in Section 0] and 

(ii) the predicate query _source{S, Q), expressing that source S is selected for eval- 
uating query Q; 

(iii) the predicates default _class{0,C,Q) and default_path{0 , P,Q); and 

(iv) a set Aaux of auxiliary predicates. 

Informally, the predicates default _class{0,C,Q) and default_path{0, P,Q) are 
projections of cref {O , C , P, Q) and serve to specify a default status for selection 
rules depending on context-reference pairs matched in the query Q. For example, 
a predicate default -class {O, "Person" , Q) in the body of rule r expresses that r is 
eligible in case the concept Person occurs in the reference path O and there is no 
other rule r' that refers to some CRP {C',P') matched in Q. These defaults are 
semantically realized using a suitable rule ordering. 

The set of all literals over atoms in Ai, for £ g {qd, sd, dom, aux, sel}, is denoted 
by Liti. 

Definition 3 

Let Asei be a source-selection vocabulary. A source- selection program over Asei is 

a tuple (n^e;, <„), where 

(i) lisei is a collection of rules over Asei consisting of the following parts: 

(a) the core unit, 11^^; , containing rules of form 

query _source{S, Q) <— Li, . . . , Lm, not Lm+i, ■ ■ ■ , not Ln, 

(b) a set n™;^ of auxiliary rules of form 

Lq ^- Li, . . . ,Lm, not L„i+i, . . . , not Ln, 

and 

(c) an optimization part, Tl°^i , containing weak constraints of form 

<= Li,. . . ,Lm, notLm+1, ■ ■ ■ ,notLn [w : I], 

where Lq is either a literal from Litaux or is of form -^ query -source{-, •), Li £ 
Litsei for 1 < i < n, and w,l > I are integers, and 
(ii) <u is a strict partial order between rules in Il^g;. 

The elements of <„ are called user-defined preferences, li ri <„ r2, then r2 is 
said to have preference over ri. 

The rules in the core unit I^^i serve for selecting a source, based on information 
from the domain description, the source description, the query description, and 
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possibly from auxiliary rules. The latter may be used, e.g., for evaluating complex 
conditions. In terms of <„, preference of source selection can be expressed. As 
well, the weak constraints in 11°^; are used to filter answer sets under quantitative 
conditions. 

By assembling all constituents for source selection into a single compound, we 
arrive at the notion of a selection base, as informally described in Sectional 

Definition 4 

Let Asei be a source-selection vocabulary. A selection base over Asei is a quintu- 
ple S = {Ilga,Ilsd,'ndom,^sei,<u), cousistiug of the query-analysis program Uga 
over Aqd, programs Hsd and Ildom over Asd and Adom, respectively, and a source- 
selection program {Jlsd, <u) over Asei- 

Given that the components H^q, ligd, and lidom are understood, the source- 
selection program {JiseU <u) in a selection base S is the most interesting part, and S 
might be referred to just by this program. Furthermore, we assume in what follows 
that the source-selection vocabulary Asei contains only those constants actually 
appearing in the elements of a selection base over Asei- Thus, we usually leave Asei 
implicit. 

Example 6 

Consider a simple source-selection program, {Jlseh <«), for our movie domain, con- 
sisting of the following constituents: 

• Source-selection rules: 

ri : query_source{s2,Q) <— default_class{0, ^^ Person" ,Q); 

r2 : query_source{si,Q) ^- s elects (O, equal, '''' Hitchcock"), 

cref{0, ^^ Director" , ^^Personalia/ 
LastName" , Q); 

r^ : query -Source{S,Q) <— default _path{0, ^^LastName" ,Q), 

default _class{0, T, Q), accurate{S, T, high). 

• Auxiliary rules: 

r4 : high_acc{T,Q) <— cref {0,T, P,Q), accurate {S,T, high); 
rs : high_cov{T,Q) <— cref {0,T, P,Q), covers {S,T, high). 

• Optimization constraints: 

Ci : <;= query _source{S,Q),high-acc{T,Q), 
not accurate{S,T, high) [10 : 1]; 

C2 : <^ query _source{S,Q),high_cov{T,Q), not covers{S,T, high) [5 : 1]. 

• User preferences: 

ri{,Q,-) <u r3((5, .,_,.). 
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Intuitively, ri advises to choose source S2 if the query involves persons and no 
more specific rule is eligible. Rule r2 states to choose source si if the query contains 
an explicit select on the movie director Hitchcock. Rule r^ demands to choose a 
source if, on some query reference path, ^^LastName" is accessed under some concept 
T (with arbitrary intermediate reference path), and the source is highly accurate for 
T. Rules r4 and r^ define auxiliary predicates which hold on concepts T appearing 
in the query such that some source with high accuracy and coverage for T exists. 
The weak constraints ci and C2 state penalties for choosing a source that does not 
have high accuracy (assigning weight 10) or coverage (assigning weight 5) for a 
concept in the query while such a source exists. Finally, ri{Q,_) <„ r3((5, _,_,-) 
expresses preference of instances of r^ over ri on the same query. 

6.2 Semantics 

The semantics of a source-selection program {Hgei ,<u) in a selection base S — {Hqa , 
Ildom, Ilsd, Hsej, <u) On a qucry Q is given by means of a selection answer set of 
(Useh <u)i which is defined as a preferred answer set of a prioritized ELP £{S, Q) 
associated with S and Q. The program £{S,Q) is of the form {Uq,<), where 
program 11 q contains ground instances of rules and constraints in Ilse;, and further 
rules ensuring that a single source is selected per query and rules defining the 
default-context predicates. The order relation < is formed from the user preferences 
<u and the implicit priorities derived from context references in the core unit and 
from auxiliary rules. Thereby, preference information must be suitably combined, 
as well as arising conflicts resolved, which we do by means of a cautious conflict- 
elimination policy. 

We commence the formal details with the following notation: For any rule r = 
H{r) <— B{r), its defaultization, r^, is given by H{r) ^- B{r),not^H{r).^ We as- 
sume that user-deflned preferences between rules carry over to their defaultizations. 

Definition 5 

Let S = iJ\qa,'Rsd, ^donn^seh<u) bc a sclcction base and Q a query. Then, the 
program 11 q contains all ground instances of the rules and constraints in 11^™ un"^;, 
as well as all ground instances of the following rules: 

(i) the defaultization r^ of r, for each r G H^g^; 
(ii) the structural rule 

-iquery-Source{S, Q) ^- query _source{S' , Q), S ^ S'; (3) 

and 

(iii) the default- context rules 

default_class{0,C,Q) <— cref{0,C,-,Q), 
default.path{0,P,Q) ^ cref {O , ., P, Q) . 



® Defa ultization is also known in the literature as the e xtended version of 
rule IKowalski and Sadri 1990l|Van Nieuwenborgh and Vermeir 2002}. 
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Intuitively, the defaultization makes the selection rules in Ilsei defeasible with 
respect to the predicate querysource, the structural rule enforces that only one 
source is selected, and the default-context rules define the two default predicates. 
Since our language has no function symbols, Hq is finite, and its size depends on 
the constants appearing in Hqa, RiQ), Hsd, and Tldom- 

Definition 6 

For S = {UqayTisd, fldomjllse;, <„) and query Q, we call any answer set of Uga U 
R{Q) U Ilsd U Ildom a selection input of S for Q. The set of all selection inputs of 
S for Q is denoted by Sel(S, Q). For Y G Sel{S, Q), we define 

Ydef — YU {default_class{o, c, q) , default _path{o,p,q) \ cref{o,c,p,q) G Y}. 

Note that, in general, a selection base may admit multiple selection inputs for a 
query Q. However, in many cases, there may exist only a single selection input — 
in particular, if the source description Ilsd and the domain knowledge Tldom have 
unique answer sets. In our framework, this is ensured if, e.g., these components are 
represented by (locally) stratified programs. 

Definition 7 

Given a selection base S = {Tlqa, Tlsd, Ildom, n^e;, <„) and a query Q, a rule r G IIq 
is relevant for Q iff there is some Y G Sel{S,Q) such that B^r) is true in Ydef, 
where B''{r) results from B{r) by deleting each element which docs not contain a 
predicate symbol from Aqd U Asd U Adom U {default -class, default _path}. 

In the sequel, we denote for any binary relation R its transitive closure by R* . 

We continue with the construction of the preference relation <, used for inter- 
preting a source-selection program {llsei,<u) relative to a selection base S and a 
query Q in terms of an associated prioritized logic program (IIq, <). 

Informally, the specification of < depends on the following auxiliary relations: 

• the preference relation ^c, taking care of implicit context priorities; 

• the intermediate relation <, representing a direct combination of user-defined 
preferences with context preferences; and 

• the preference relation <', removing possible conflicts within the joined rela- 
tion < and ensuring transitivity of the resultant order < . 

More specifically, the relation ^c is the first step towards <, transforming struc- 
tural context information into explicit preferences, in virtue of the following speci- 
ficity conditions: 

• default contexts for concepts are assumed to be more specific than default 
contexts for attributes; 

• context references are more specific than default contexts; and 

• with respect to the same IRP, rules with a larger CRP (C, P) are considered 
more specific than rules with a shorter CRP {C',P') (i.e., where P' is a 
subpath of P) . 
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The second step in the construction of < is the relation <, which is just the union 
of the user preferences <„ and the context priorities ^c- In general, this will not be 
a strict partial order. To enforce irreflexivity, we remove all tuples n^ ^ nr^ lying 
on a cycle, resulting in <'. Finally, taking the transitive closure of <' yields <. The 
formal definition of relation < is as follows. 

Definition 8 

Let iS be a selection base, Q a query, and Hq as in Definitional For ri,r2 G ILq, 

define 

(i) Ti :<c r2 iff ri and r2 are relevant for Q, ri ^ r2, and one of (Oi)-(03) holds: 

(Oi) default_path{oi,pi,q) G B{ri), and either ere/ (02, ^2,^2, ?) G ^(''2) or 
default_class{o2,t2,q) S B{r2), 

(02) default.class{oi,ti,q) e i?(ri) and cref{o2,t2,P2,q) G ^(^2), 

(03) cref{o,ti,pi,q) G -B(ri), cref{o,t2,P2,q) G -B(r2), and ti/pi is a sub- 
path oi t2/p2, 

(ii) ri < r2 iff ri <„ r2 and ri and r2 are relevant for Q, or ri :<c r2, and 
(iii) ri <' r2 iff ^i ^ ^2 but not r2 f3* ri. 

Then, the relation < is given as the transitive closure of <'. 

Example 7 

Reconsider (Tisei, <«) from ExamplelHl Suppose the domain ontology contains the 
concepts ^'MovieDB^ , ^^AcIot'^ , "Movie" , "Director'^ , and "Person" , and that "Per- 
sonalia" and "Person" are synonymous. Assume further that the query of Exam- 
ple ^ (represented by qi) has a unique selection input Y, containing the source- 
description facts 

accurate(si, " Director" , high) , covers{s2, "Person" , high), and reliable(s3,low), 

together with the following facts resulting from the query description and the 
default-context rules: 

cref{o2, "Person", "FirstName" ,qi), 

cref{o2, "Director", "Personalia/ FirstName" ,qi), 

cref{o3, "Person", " LastName" , qi) , 

cref{o3, "Director", "Personalia/ LastName" ,qi), 

selects(o3, equal, "Hitchcock"), 

default _class{o2, "Person" , qi), 

default_class{o3, "Person" , qi), 

default _class{oy,, "Director" , qi), 

default _path{o3, "LastName" , qi). 

These elements are exactly those contributing to relevant instances of IIq. The 
relevant instances of ri, r2, and r^ are given by the ground rules ri{qi, 02), ri{qi, 03), 
''2('?i,03), and r3((7i, si, 03, "D")J Intuitively, we expect r2{qi,03) to have highest 

''' For brevity, we write here and in the remainder of this example "D" for "Director" . 
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priority among these rule instances, since the bodies of the instances of ri and 
ra contain default predicates while r2 references a specific context. Actually, the 
order relation < includes, for the relevant instances of ri, r2, and r^, the pairs 
ri{qi,02) < r2{qi,03), ri ((71,03) < r2{qi,03), and r3(gi, si, 03, "£>") < r2{qi,03). 

Note that both r4 and r^ have two relevant instances. However, they do not 
influence the above rule ordering. Informally, they are either unrelated to or "ranked 
between" r2{qi, 03) and the relevant instances of ri and r^ (since the cref predicates 
of r4 and r^ refer to the same context as the context referenced in the body of r2, 
or to a subpath of such a context). Hence, the relevant instance of r2 has highest 
priority. 

As for ri and r^, the auxiliary relation < contains two further structural prior- 
ities, namely r3(gi, si, 03, "£>") ^c ri{qi,02) and r3(gi, si, 03, "D") ^c ri(gi,03). 
They are in conflict with the user preferences ri (51,02) <« r3(gi, 51,03, "D") and 
''i('7ii03) <u ''3(91,51,03, "-D"), respectively. This is resolved in the resultant rela- 
tion < by removing these preferences. 

Note that, in Definition|Hl the final order < enforces a cautious conflict resolution 
strategy, in the sense that it remains "agnostic" with respect to priority information 
causing conflicts. Alternative deflnitions of <', such as removal of a minimal cutset 
eliminating all cycles in <, may be considered as well; however, this may lead to 
a nondeterministic choice since, in general, multiple such cutsets exist. Different 
choices lead to different orders < , which may lead to different results of the source- 
selection program. Thus, unless a well-defined specific minimal cutset is singled out, 
by virtue of preference conflicts, the result of the source-selection process might not 
be deterministic. Furthermore, an extended logic program component computing a 
flnal order based on minimal cutsets is more involved than a component computing 
the relations in Deflnition|Hl 

Combining Definitions |S1 and |S1 we obtain the translation £{■, •) as follows: 

Definition 9 

Let iS be a selection base and Q a query. Then, the evaluation £{S,Q) of S with 
respect to Q is given by the prioritized logic program (Hg, <), where Hq and < are 
as in Definitions [S] and IHl respectively. 

Selection answer sets of source-selection programs are then obtained as follows: 

Definition 10 

Let S = (Hga,Hsd, HrfoTO,Hse;, <„) be a selection base, Q a query, and £{S,Q) = 
(Hq,<) the evaluation of S with respect to Q. Then, X C Litsei is a selection 
answer set of {Usei, <u) for Q with respect to 5 iff X is a preferred answer set of 
the prioritized logic program (Hq U Y, <), for some Y e Sel{S, Q). 

A source s is selected for Q iff query_source{s, q) belongs to some selection answer 
set of (Jlseh <u) for Q (with respect to S), where the constant q represents Q. 

Example 8 

In our running example, (H^e;, <„) has a unique selection answer set X with respect 

to iS for query qi from ExampleEl It contains query -Source{si , qi), which is derived 



22 Thomas Eiter, Michael Fink, and Hans Tompits 

from the core rule r2{qi, 03), having the highest priority among the apphcable rules 
leading to a single preferred answer set for the weak-constraint free part of Yisei- If 
we replace, e.g., ri by the rule 

query _source{s2, Q) <— cref{0, "Person" , P,Q) 

and adapt the corresponding user preference to ri{Q,_,.) <„ r3{Q, _,_,_), then 
the weak-constraint free part of Hg^i has two preferred answer sets: one, Xi, is 
identical to X (where applying ?'2 ((71,03) is preferred to applying r 1(91,03), given 
that ■''1(91,03) < r2((7i,03)); in the other answer set, X2, the rule ri (91,02) is 
applied and query _source{s2, 91) is derived. Informally, the replacement removes the 
preference of 7-2(91, 03) over 7-1(91, 02), since the corresponding cref predicates refer 
to different contexts {".../ Fir stName" and ".../LastName" , respectively). Thus, 
^1(91702) has maximal preference like r2(9i,03). 

Given that Xi has weight 5, caused by violation of C2(si, 91, "Person"), but 
X2 has weight 10, caused by violation of ci(s2,9i, "Director"), Xi is the selection 
answer set of (Hsej, <„) for Q. 

6.3 Properties 

In this section, we discuss some basic properties of our framework. 

The first property links our evaluation method of source-selection programs to 
the usual semantics of prioritized logic programs. For this purpose, we introduce the 
following concept: Given logic programs Hi and 112, we say that Hi is independent 
of 112 iff each predicate symbol occurring in some rule head of 112 does not occur 
in Hi. Intuitively, if Hi is independent of 112, then Hi may serve as an "input" for 
112. This idea is made precise by the following proposition, which is an immediate 



consequence of results due to Eiter et al. (1997j and ^Lifschitz and Turner (1994 1 : 



Proposition 1 

Let Hi and 112 be two extended logic programs, possibly containing weak con- 
straints, and let X be a set of ground literals. If Hi is independent of 112, then X 
is an answer set of Hi U 112 iff there is some answer set Y of Hi such that X is an 
answer set of 112 U F. 

Now, taking the specific structure of our source-selection architecture into ac- 
count, we obtain the following characterization. 

Theorem 1 

Suppose S = {Jlqa,IVsd,^dom,^seiT<u) is a sclcction base and Q a query. Let 
£{S, Q) = {Uq, <) and UsiQ) = H,, U R{Q) U U^om U U,d U Uq. Then, X is a 
selection answer set of {Ilsei, <«) for Q with respect to 5 iff X is a preferred answer 
setof (ns(Q),<). 

Proof 

Let Ho denote the program Uqa U R{Q) U 11^^ U Udom- Recall that X is a selection 

answer set of (Ilse;, <„) for Q with respect to <S iff X is a preferred answer set of 
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(IIq U y, <), for some answer set Y of Hq. Since the predicate symbols occurring 
in the heads of rules in a source-selection program do not occur in rules from the 
query description, the source description, or the domain theory, we obviously have 
that Ho is independent of IIq. Moreover, it holds that X is a preferred answer set 
of (IIq U Y, <) only if X is an answer set of IIq U Y. Hence, applying Proposition^ 
we have that X is an answer set of IIq U Y, for some answer set Y of Ho, iff X is 
an answer set of IIq U Ho. From this, the assertion of the theorem is an immediate 
consequence. D 

We remark that from a logic programming point of view. Theorem Q] might seem 
to be a more natural definition of selection answer sets. However, our approach is 
motivated by providing a high-level means for specifying source-selection problems, 
which is accomplished by decomposition. Note, in particular, that a user will only 
need to specify the relation <„ as opposed to <. Hence, the property of Theorem^] 
shall rather be understood as a possibility to "compile" a selection base and selection 
inputs with respect to a query into a single logic program. 

Strengthening Theorem ^ the construction of £{S,Q) can itself be realized in 
terms of a single logic program of the form Ils{Q) U HobjiQ) over an extended 
vocabulary, by describing preference relations directly at the object level, such that 
each answer set encodes the priority relation < and is a preferred answer set of 
(UsiQ) U HobjiQ), <) if ^'^d only if its restriction to Litsei is a selection answer set 
of {Hsei, <u) for Q with respect to S. More details about this property are given 
in [Appendix D| 

Concerning the computational complexity of source selection, we note that, given 
a query Q and the grounding of the program Ils{Q) for a selection base S as in 
Theorem n deciding whether {Hsei, <u) has some selection answer set for Q is NP- 
complete (since the grounding of Hobj (Q) can be constructed in polynomial time 
from the grounding of Ils{Q)), and computing any such selection answer set is 
complete for FP^^, which is the class of all problems solvable in polynomial time 
with an NP oracle. However, for a fixed selection base and small query size (which is 
a common assumption for databases), the problems are solvable in polynomial time 
(cf. again [Appendix D| for more details about the complexity of source-selection 
programs). 

One of the desiderata of our approach is that each answer set selects at most one 
source, for any query Q. The following result states that this property is indeed 
fulfilled. 

Theorem 2 

Let A" be a selection answer set of {Hsei, <«) for query Q with respect to S. Then, 

for any constant q, it holds that 

\{s I query _source{s,q) £ X}\ < 1. 

Proof 

The presence of the structural rule (O in the evaluation program Hq enforces that, 

whenever X contains two ground atoms query _s our ce{s,q) and query -Source{s',q), 
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X must be inconsistent, and thus X violates the consistency criterion of answer 
sets. D 

Lastly, the following result concerns the order of application of source-selection 
rules, stating that source selection is blocked in terms of priorities as desired. 

Theorem 3 

Let X be a selection answer set of {Tlsei, <«) for query Q with respect to S, and let 
r^ g Hq be the defaultization of some rule r belonging to the grounding of H^^^ 
for Q with respect to S. Suppose that B{r) is true in X but H{r) ^ X. Then, there 
is some r' G Hq such that 

(i) either r' belongs to the grounding of n"gf for Q with respect to S and H{r') = 
-'H{r), or r' is the defaultization of a rule from the grounding of Lt^^; for Q 
with respect to S, 

(ii) B{r') and H{r') are true in X, and 

(iii) either r'^ and r' are incompatible with respect to <, or else r^ < r' holds. 

Proof 

Given that X is a selection answer set of (Jlseh <u) for query Q with respect to S, we 
have that X is a preferred answer set of the prioritized logic program (Eg U F, <), 
where Y is some selection input of S for query Q, and, a fortiori, that X is an 
answer set of Hg UY. From the latter and the hypothesis that H{r) ^ X, it follows 
that r^ = H{r) ^- B{r), not -^H{r) is not a member of GR{X, Hq U Y). Hence, in 
view of the assumption that B{r) is true in X, we get that ~^H{r) G X must hold. 

Since X is a preferred answer set of {UqUY, <), there is some enumeration {ri)i^i 
of GR{X,Uq U Y) such that Conditions (Pi)-(P3) hold (cf. Section EJ. We take 
r' = ri, where i is as follows. Given that ^H{r) e X , there is a smallest index ia ^ I 
such that r^Q e GR{X,Uq[JY) and H{ri„) = -^H{r). liri^ belongs to the grounding 
of n™f for Q with respect to S, then £ — ig. Otherwise, by the syntactic form of 
a source-selection program, r^^ must be an instance of the structural rule Q. By 
(Pi)-(P3), the defaultization f of a rule from the grounding of nj^; for Q with 
respect to S must exist such that f — rj^ G GR{X,IIq U Y) and jo < *o- In this 
case, £ = jo- 

We show that r' satisfies Conditions (i)-(iii). Clearly, Condition (i) is satisfied. 
Furthermore, Condition (ii) is an immediate consequence of the fact that r' G 
GR{X,]Iq L)Y). It remains to show that Condition (iii) holds. 

Towards a contradiction, assume that r' < r'^. Since r' e GR{X,IIq U Y) and 
r^ ^ GR{X,Uq U Y), from Condition (Pg) we get that B-{r'^) n {H{rk) \ k < 
£} ^ 0, as P+(r) C X and P+(r^) = P+(r). Now, obviously {-ff(rfe) | fc < ^} C X. 
Moreover, since B~(r) n X = and B~{r'^) = B~{r) U {^H{r)}, we obtain that 
~^H{r) e {H{rk) \ k < i}. Hence, there must be some kg < i such that rkg G 
G'i?(X, Hq U Y) and H{rko) — ~^H{r). But this contradicts the condition that iq 
(> ^) is the smallest index i such that r^ G Gi?(X, Hq U Y) and H(ri) = -^H{r). 
Hence, we either have that r' and r are incompatible with respect to < , or r < r' 
must hold, n 
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6.4 Extended source selection 

The semantics of source-selection programs we defined so far aims at selecting at 
most one source. We can easily modify this definition, however, to accommodate 
also the selection of multiple sources at a time. To this end, we only have to modify 
the structural rule lO in Definition [S] appropriately. 

For example, using language elements provided by the DLV system l|Leone et al. 20061 
IFaber et al. 2 004). the simultaneous selection of up to a given number k of sources 
can be accomplished by replacing |j2l with the following rules: 

false <— not false, query (Q), max-SOurces{K) , 
i^count{S' : query _source{S' , Q)} > K, 

-• query _source{S,Q) ^ source{S), query{Q), max_sources{K), 

1 <= 4t^count{S' : query _source{S' , Q)} <= K, 
not query _source{S,Q), 

where max _sources{K) holds for K = k. Here, jj=count{S' : query _source{S' ,Q)} 
is an aggregate expression which singles out the number of all sources S" for which 
an instance of query _source(S' ,Q) is in the answer set, and "<" and "<=" are 
comparison built-ins. This modification can also be expressed with (ordinary) ELPs 
as introduced in Sectional but is more involved then. 

The setting of selecting a "best" source with a single selection result can be easily 
generalized to a setting with multiple, ranked selection results — in particular, to the 
computation of all outcomes with a cost valuation within a given distance d to a 
given value, as well as to the computation of the k best outcomes, for a given integer 
fc, akin to range queries and k-nearest neighbor queries, respectively, in information 
retrieval. Such ranked computations can be orthogonally combined with the type of 
selection outcome (i.e., single source vs. up to a number of sources). Furthermore, 
they can be easily accomplished using the features of the underlying DLV system. 

7 Implementation and application 

7. 1 Implementation 

We have implemented our source-selection approach on top of the DLV system 
l|Leone et al. 2006|l and its front-end pip ( |Delgrande et al. 2001| ) for prioritized logic 
programs.* The evaluation of source-selection programs proceeds in three steps: 
(i) the set of all selection inputs for a query Q is computed from R{Q), H^a, ^sd, 
and li-dom using DLV (cf. Theorem^in |Appendix DJ ); (ii) a call to DLV calculates 
the priority relation < from the set of selection inputs and Ilse; ; and (iii) the answer 
sets of iJi-Q, <) are determined by employing pip and DLV. Note that this three- 
step approach might appear to be overly complex, given that computing a selection 
answer set is feasible in polynomial time with an NP oracle (see Section 16.31 and 
Theorem |31 in [Appendix D| ) , and one might wonder why DLV (which can handle 

* Details about DLV and pip can also be found at |http://«ww.dlvsystem.com| and 
http://www.cs.unl-potsdaiii.d6/"torsten/plp respectively. 
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Fig. 2. Architecture of a simple agent-based source-selection system 



S|'-complete problems) is called several times. The reason for proceeding in this 
fashion is that it actually greatly improves the performance since, due to built-in 
optimization techniques of DLV, groundings can be kept smaller. 

The entire process is implemented as an ECLiPSe Prolog program, which served 
as a rapid prototyping language, and is independent of the actual query language. 
For XML-QL queries, however, a query parser, written in C++, for generating the 
low-level representation R(ff) of a query Q has been developed. A query parser 
for SQL queries is also available IjSchindlauer 2002)l and further languages can be 
deployed in the same way. 

We have also "agentized" the source-selection system using the IMPACT agent 
platform IjSubrahmanian et al. 2000JI . enabling the realization of source-selection 
agents which may also issue the execution of XML-QL queries on XML data sources. 
A generic agent-based source-selection setup, as implemented in IMPACT, is shown 
in Figure 12 Data are stored in XML databases, and queries are posed in an XML 
query-language such as XML-QL. Some of the databases may be wrapped from 
non-XML data sources. A query is handed over to an information agent, which has 
to pick one of several databases that comply with the same (universal) schema to 
answer the query. 

The architecture in Figure |21 is only one of several possible agent-based architec- 
tures; others may be as follows: 



• there may be multiple information agents in a system, avoiding a centraliza- 
tion bottleneck; 

• the source-selection capability may be realized not in terms of a special source- 
selection agent, but being part of a more powerful mediator agent; or 

• the sources may be accessed through specialized wrapper agents, which control 
access and might refuse requests. 



7.2 An application for movie databases 

As an application domain, we considered the area of movie databases, and we have 
built an experimental environment for source selection in this domain, using the 
prototype implementation described above. 
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7.2.1 Movie sources 

We used the Internet Movie Database [IMDh) as the main source for raw data, as 
weh as the EachMovie Database provided by Compaq Computer Corporation,^ to 
generate a suite of XML movie databases. To this end, (parts of) the large databases 
were wrapped offline to XML, using a DTD (provided in [Appendix A| ) which we 
modeled from a set of relevant movie concepts captured by the Open Directory 
Project.^^ The XML databases we constructed are the following: 

RandomMovies (RM): This source contains data about numerous movies, ran- 
domly wrapped from the IMDb. Besides title and language information (always 
having value "English"), each item comprises, where available, entries containing 
genre classification, the release date, the running time, review ratings, the names 
of the two main actors, directors, and screenwriters, as well as details about the 
soundtrack (listing, in some cases, the name of the composer of the soundtrack). 

RandomPersons (RP): Like RM, RandomPersons is derived from the IMDb, 
containing randomly wrapped data about numerous actors, directors, screen- 
writers, and some composers. Besides names, person data comprise the date and 
country of birth, and a biography, and may, as for RM, again be incomplete. 

EachMovie (EM): Wrapped from Compaq's EachMovie Database, this source 
stores English movies plus ratings. For most entries, it provides genre information, 
and for half of them a release date (after 1995). It has no information about actors, 
directors, soundtracks, etc., however. 

Hitchcock (HC): Wrapped from the IMDb, this source stores all movies directed 
by Alfred Hitchcock, in the format of RandomMovies but with all involved actors 
listed. For each person, it also contains information (if available) about the date 
and country of birth, and a biography. 

KellyGrant (KG): Similar to HG, this source stores the titles of all movies in 
which either Grace Kelly or Gary Grant were actors, as well as the names of all 
persons involved. 

HorrorGO (H60): Being the last of our databases, H60 is a collection of horror 
movies from the 1960s, as found in the IMDb. Movie and person data are as 
before, but almost no soundtrack or composer information is stored. 

The information about these databases is stored in the source-description pro- 
gram Ilsd, using the predicates introduced in Section^ For illustration, we list some 



elements of this program, modeling one of the sources, and refer to Eiter et al. (2003 1 



and Fink (20021 for a detailed account of the complete program Hsd- 



Example 9 

For providing information about database KellyGrant, the program Hgd contains 

the following facts: 



^ These two databases are available at |http : //www ■ imdb . org| and 

http : //www .research . Comp aq ■ com/SRC/eachmovle | respectively. 
^^ See http://djnoz.org 
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source{s-Kelly Grant); 

up{s_KellyGrant)\ 

dataJormat{s_KellyGrant, xml)\ 

update -frequency {s^Kelly Grant, low); 

specialized {s -Kelly Grant, ^^Kelly"); 

specialized {s -Kelly Grant, ^^ Grant"); 

covers{s-Kelly Grant, "Movie" , low); 

covers{s-KellyGrant, c, high), for c G {" ReleaseDate" , "Person", "BirthDate", 

"Actor"}; 
covers{s -Kelly Grant , fifties , high); 
covers{s-Kelly Grant , sixties, high); 
^relevant{s-KellyGrant,p), for p G {seventies , eighties, nineties, twothousands} . 

Informally, Hsd expresses that KellyGrant is an XML source which is (cur- 
rently) up and rarely updated. It is specialized in topics "Kelly" and " Grant" , and 
has high coverage about persons, especially actors, and their birth dates, but pro- 
vides low coverage about movies in general. However, it highly covers the release 
dates of the stored movies, most of which are from the fifties and sixties. Further 
information about KellyGrant is derived from default rules like the ones given 
below, stating that English is the default language for all sources: 

source-language{S," English") ^- source{S), 

not —< source -language{S, "English"); 

-isource-language{S, "English") *— source-language{S, L), L ^ "English". 

7.2.2 Domain knowledge 

The ontology part of the domain knowledge, lidorm includes the facts 

classip), ior O e {"MovieDB" , "Movie", "Director", "Actor", "Screenwriter", 
"Composer" , "Person", "Soundtrack" , "Review"}, 

as they may be extracted from the XML DTD, and the fact 

synonym{" Personalia" , "Person"). 

The attributes of the concept "Movie" are given by the facts 

class -att{" Movie" , att), where att £ {title, alternativeTitles, genre, releaseDate, 

runningTime, language, review}. 

For example, a concrete instance of "Movie" is given by instance{ml2, "Movie"). 



For further details, cf. Eiter et al. (2003 ) or Fink (20021. 



The background part of Udom serves to formalize "common-sense" knowledge 
of the application domain, which is an important source of information for the 
selection process. This part is usually quite extensive. On the one hand, it contains 
rules capturing typical relationships between ontological concepts, and, on the other 
hand, it comprises "well-known" instances of these concepts. For space reasons, 
we only show a few rules of Udom here. We note in passing that this part also 
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implements a simple form of reasoning about time, viz. reasoning about decades, 
by associating every year since 1920 its corresponding decade. 

Example 10 

Some (typical) rules from the background knowledge are: 

si : instance (P, ^^ Director") <— directed {P, M)] 

52 ■ involved{P, M) ^- directed {P, M); 

53 : lif e -period {P, B, E) ^- instance{P,^^ Person"), not dead{P), 

att_val{P, birthDate, Bi), currently ear (E), 
calender .y ear {B I, B); 

54 : possihle_genre{M,G) ^- involved{P,M), default _genre{P,G), 

not defined -genre{M). 

Intuitively, rules si and S2 infer, from a role acted between a person and a movie, 
that the corresponding person is an actor and that he or she is involved in the 
movie. Rule S3 assigns a life period to a person from his or her birth date, while 
rule S4 infers a possible genre for a movie, if an involved person and his or her 
default genre are known. 

Furthermore, the following facts are representations of specific movie-historic in- 
cidents (actually, they model information about Grace Kelly and the movie "Arsenic 
and Old Lace"): 

instance {per K elly , ''Actor"); 
att-val {per Kelly, name, nameKelly); 
att _val{perKelly , birthDate, 1929); 
att_val {per Kelly, dateOfDeath, 1982); 
prod_period{perKelly , 1945, 1960); 

instance{nameKelly , name); 
att_val{nameKelly,firstName, ''Grace"); 
att-val { nameKelly , firstName , "Patricia" ) ; 
att _val {nameKelly, lastName, "Kelly"); 

instance{ml2, "Movie"); 

att _val{ml2, title, "Arsenic and Old Lace"); 

att_val{ml2, releascDate, 1944); 

acted{per Grant, ?7il2). 

7.2.3 Source- selection program 

The experimental movie source-selection program fills several pages and is too com- 
plex to be listed and discussed here in detail. Therefore, similar as before, we only 
give an informal description, highlighting the most important aspects, and refer to 



Eiter et al. (2003| ) and |Fink (2002| ) for more details. 

Among the source-selection rules, default rules have lowest priority and are used 
only in the core part. They make default suggestions for query sources in case no 



30 Thomas Eiter, Michael Fink, and Hans Tompits 

other core source-selection rule is eligible. Some examples are the following default 
rules: 

ri: query _source{S,Q) ^- default_path{0, P,Q), 

occurs(0, V), specialized [S^ P); 

r^'- query _source{s_RandomMovies,Q) ^- default_class{0, ^^ Movie" ,Q); 

r^: query _source{s_RandomPersons,Q) ^- default_class{0, ^^ Person" ,Q). 

The first rule is generic, whilst the others are specific. Informally, ri advises to 
query source 5* if it is specialized for P, where P is some path of a reference in the 
query that is compared to some value. For example, suppose P is instantiated with 
LastName. If some source is specialized for last names, then it is chosen unless a 
source-selection rule with higher priority is applicable. Similarly, the specific rules 
r2 and ra suggest to select RandomMovies or RandomPersons if the query 
entails a reference under object ^^ Movie" or ^^ Person" , respectively. 

Non-default core source-selection rules also appear in cither generic or specific 
form: 

r^: query _source{S,Q) ^ source{S), query{Q), 

hiqh_coverage{S, Q); 

r^: query_source{S,Q) <— source{S), query{Q), special{S,Q); 

rg: query _source(s_Hitchcock,Q) ^ cref (Oi, ^^ Person" , "LastName" ,Q), 

selects{Oi, equal, "Hitchcock"), 
cref{02, "Person" , "FirstName" , Q), 
selects{02, equal, "Alfred"). 

The generic rules r^ and r^ suggest to query any source that highly covers the 
query or is special for it, respectively. The specific rule rg advises to query the 
source Hitchcock if a query selects a person named Alfred Hitchcock. Note that 
high -Coverage and special are auxiliary predicates, defined by auxiliary rules (see 
below). 

Since no crej predicate and no default predicates occur in r4 and r^, there is no 
(direct) structural precedence between them and rule rg, as well as between ri, r2, 
and ra . The following user preferences explicitly establish preferences among them: 



ri{-,Q, -,-,-) <u r5{.,Q) 
7-2 (Q,-) <u r5i-,Q) 
r^{Q,_)<ur<s{-,Q) 



ri{.,Q) <u rz{.,Q); 
ri{.,Q) <u rg(Q,_,_); 
r->{-,Q) <u rg(Q,_,_). 



Auxiliary rules are used to define auxiliary predicates as well as to filter irrelevant 
sources: 

fli: special{S,Q) ^ special Aopic{S,Q,T); 

02'. special Aopic{S,Q,T) '^ inJerred_topic{Q,T), specialized{S,T); 

as: inf erred -topic {S,Q,T) ^ matchingMovie{Q, M),involved{P, M), 
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att_val{P, name, N), att_val{N, lastName, T); 

04: -^ query _source{S,Q) ^ irrelevant {S,Q); 

a^: irrelevant{S,Q) '^ cref{0,"MovieDB" , 

^^ Movie/ ReleaseDate/ Date^^ , Q), 
selects{0, equal, V), calender _y ear (V, Y), 
decade(Y, D), -^relevant{S, D). 

Informally, rule ax states that a source is special for a query if a topic associated 
with the query exists for which it is special. Rule 02 expresses that one way to 
associate a topic to a query is to infer a topic, like, e.g., realized in terms of rule 
03. Hence, if the query accesses a movie that is known and T is the last name of a 
person involved in it, then a source is concluded to be special for that query if it is 
specialized for T . Rule 04 states that a source must not be queried if it is irrelevant 
for a query; in view of rule 05, this is the case if the source is not relevant for the 
decade in which movie has been released. 

Finally, the quantitative part of the source-selection program has weak con- 
straints like the following: 

wi. <= query_source{S,Q), default-class{0,T,Q), 

constructs{0,C, P), covers{Si,T,high), not covers{S,T, high) [3:1]; 

W2'- ^ query _source{S, Q), high_covered-topic{Si,Q,T), 

decade _name{T), not high_covered-topic{S,Q,T) [1:1]. 

Intuitively, wi assigns a penalty of 3 per concept T that is asked (resp., constructed) 
by the query to any answer set which selects a source that does not highly cover 
T while a source highly covering T exists. Similarly, W2 assigns a penalty of 1 per 
decade that is associated to the query to any answer set which selects a source 
which docs not highly cover this decade while some other highly covering source 
exists. 



7.3 Experiments 

We tested the above movie-application scenario by means of a number of natural 
user queries. More specifically, our tests involved 18 queries, some of which are the 



following (for the complete list of queries, cf. Eiter et al. (20031 or Fink (2002 1): 



Which movies were directed by Alfred Hitchcock? 

In which movies, directed by Josef von Sternberg, did Marlene Dietrich act? 
In which year has the movie "Arsenic and Old Lace" been released? 
In which movies, directed by Alfred Hitchcock, did Marlene Dietrich act? 
In which film noirs did Marilyn Monroe act? 
In which movies did Laurel and Hardy act in 1940? 

Which movies where Frank Sinatra appeared in have a soundtrack composed 
by Elmer Bernstein? 
qs'. When was James Dean born? 



91 

92 
93 
94 

95 
96 
97 
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Table 1. Experimental Results for the Movie Application 



Query 


qi 


92 


qs 


qi 


95 


qe 


q7 


q& 


Candi- 
dates 


HC 


RM, H60, 
HC, KG 


KG 


HC 


RM, HC, 
KG 


RM, HC, 
KG 


RM, H60, 
HC, KG 


RP 


Best 


HC 


RM 


KG 


HC 


RM 


RM 


RM 


RP 



The formulation of these queries in XML-QL is straightforward (as a matter of 
fact, gi is expressed by the XML-QL query of Example^ for the formulation of all 



queries in XML-QL, cf. Eiter et al. (2003 1 or Fink (2002)) 



Source selection for the considered queries was performed employing the movie 
databases described above as well as variants thereof. Each process took from a 
couple to up to tens of seconds, which is due to the size of the programs involved. 
However, performance was not a central issue here. Since our implementation and 
the used tools are unoptimized, there is a large potential for performance improve- 
ments. Also, the underlying solvers might gain efficiency in future releases. 



7.3.1 Results 

The results of the source selection process for qi-qs, using the above source de- 
scriptions, are shown in Table Q] Note that, by the semantics of source-selection 
programs, per selection answer set and query, a single source is chosen. Thus, query 
decomposition is not considered here, although our method for computing a query 
description allows for it in principle. The entries show the sources which are se- 
lected by the different answer sets, where the labels "Candidates" and "Best" refer 
to selection with optimization part dropped (i.e., qualitative selection only) and 
enabled, respectively. 

The results can be informally explained as follows. For qi, a specific core source- 
selection rule, rg, which has highest preference, fires and HC is chosen, as expected. 

For 52 J there is some background knowledge about Marlene Dietrich, but no 
source can be found as being special for this query, while generic default source- 
selection rules trigger for all sources. Nonetheless, RP and EM are recognized as 
being irrelevant for (72 and eventually discarded: 52 asks for (resp., ranges over) con- 
cepts these sources are not relevant for (viz. "Movie" and "Person", respectively). 
The best source among the candidates RM, H60, HC, and KG is RM, since it is 
the only one highly covering the concept asked for. 

For qs, since "Arsenic and Old Lace" is in the background knowledge (cf. above), 
and since Cary Grant acted in it, we would expect KG to be queried. Indeed, this 
is what actually happens. It is not a specific core source-selection rule that triggers 
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Table 2. Experimental Results for an Extended Selection Base 

Query qi 172 93 94 gs ge 177 98 

Candi-'RC RM, RMN, KG HC RM, RMN, RM, RMN, RM, RMN, RP 
dates H60, HC, KG HC, KG HC, KG H60, HC, KG 

Best HC RMN KG HC RM, RMN RM, RMN RM RP 



the selection (Gary Grant does not explicitly appear in the query), rather Grant 
is inferred as a query topic from the background knowledge and, thus, the generic 
core selection rule suggesting to query KG has highest priority. 

Query 94 is a refinement of gi ; the same specific core source-selection rule, rg , as 
for qi triggers. 

Similar as for 52, RM is chosen for 95, qgj and 97, but for the former two, H60 
is recognized as being irrelevant on different grounds: (75 asks for film noirs, and so 
H60, which contains horror movies, is eliminated by reasoning over genre informa- 
tion, while qg involves movies from 1940, and thus H60, which contains only movies 
produced in the 1960s, is excluded by reasoning over decades. 

Finally, RP is chosen for qg, as expected: a specific default source-selection rule 
triggers for RP, which has precedence over generic default rules that would trigger 
for other sources. 



7.3.2 Results with modified selection bases 

In a slightly different scenario, RM is designed to have high coverage about com- 
posers and western movies, too, and a new random movie source, RandomMovies- 
New (RMN), similar to RM, but with less coverage about genres, release dates, 
composers, and western movies, while having high coverage about directors, dra- 
mas, and comedies, is introduced. Respective changes to the source descriptions and 
the addition of a specific default source-selection rule for RMN (similar to rule r2 
for RM in Section |^^3J and corresponding user preferences to the source-selection 
program yield an "extended" selection base, for which the results for gi-gg are 
shown in Table El 

The change does not influence the results for gi, 53, ^4, and q%. This is intuitive, 
since the suitability of the chosen sources is unaffected. For the other queries, the 
new source RMN is a further candidate, as the generic default source-selection rule 
is also applicable to it. By a similar reason as before, RM and RMN are better than 
the other candidates. For (72, RMN is ranked above RM: it highly covers dramas, 
which is an inferred query topic, since drama is a default genre for Marlene Dietrich 
in the background knowledge. RM is ranked above RMN for 57 since RM highly 
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Table 3. Experimental Results for a Reduced Selection Base 



Query 


11 


92 


93 


'?4 


<15 


16 


q7 


qs 


Candi- 
dates 


HC 


H60, HC, 
KG 


KG 


HC 


HC, KG, 


HC, KG 


H60, HC, 
KG 


RP 


Best 


HC 


KG 


KG 


HC 


KG 


HC, KG 


KG 


RP 



covers composers, a concept occurring in the query (which asks for the composer 
Elmer Bernstein). RM and RMN are ranked equal for 175 and qg: they highly cover 
actors, but the background knowledge has no information about Laurel and Hardy; 
and that Marilyn Monroe's default genre is comedy has no consequence for 95, as 
it explicitly asks for film noirs. 

As a further modification, we considered a "reduced" selection base where RM is 
down and, thus, cannot be queried. The results are given in Table|31 The candidate 
sources remain the same, except that RM is missing; thus, the change has no impact 
on qi, 53, 54, and gg, as one would expect. For qq, the optimization part imposes no 
preference between HC and KG; interestingly, it selects KG as being best for q2, 
gs, and qj. This is because the background knowledge entails information about the 
productive period of Dietrich, Monroe, and Bernstein, which occur in the queries. 
Thus, the 1950s and 1960s are inferred as being relevant topics for these queries, 
and KG, covering both decades highly, outranks HC and H60, which highly cover 
only one of the decades each. 



8 Related work 

The selection of data sources is a component in many information-integration sys- 

tems (cf., e.g.,|Arens et al. (1993|l,|Bayardo et al. (1997|),|Garcia-Molina et al. (1997|), 

Singh et al. (1997| ), |Genesereth et al. (1997| ), and |Levy et al. (1996| ); seealso |Levy and Weld (2000 1 



and references therein). However, most center around mappings between a global 
scheme and local schemes, on query rewriting, and on query planning to optimally 
reconstruct dispersed information. Our work, instead, is concerned with qualitative 
selection from different alternatives, based on rich meta-knowledge and a formal se- 
mantics respecting preference and context information involving heuristic defaults, 
which is not an issue there. Furthermore, no form of query description similar as in 
our method is considered in these approaches. 

In the following subsection, we review some of the above mentioned information- 
integration systems in more detail. Afterwards, we discuss approaches bearing a 
closer relation to our work. 
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8.1 Information integration systems 

SIMS HArens et al. 19931 lArens and Knoblock 19921 lArens et al. 1996|l . short for 
Services and Information Management for Decision Systems, is a data integratfon 
system which exploits a semantic model of a problem domain to integrate informa- 
tion dispersed over various heterogeneous information sources. The latter are typi- 
cally databases, or, more generally, knowledge bases. The domain model is formu- 
lated in the Loom knowledge-representation language ( [MacGregor and Bates 1987| |, 
and comprises a declarative description of the objects and activities possible in the 
specific domain. SIMS aims at providing the user a transparent access to the data, 
without being aware of the underlying heterogenous data sources. It accepts user 
queries in the form of a description of a class of objects about which information 
is required. Any such query over the domain model is mapped to a query over the 
information sources, by translating the concepts of the domain to corresponding 
concepts in the data models of the information sources; if a direct translation does 
not exist, a query rewriting is performed, and, if needed, multiple databases are ac- 
cessed in a query plan. SIMS strives for singling out optimal query plans, for which 
aspects such as costs of accessing the different sources and combining the results 
returned are taken into account. This is apparently different from the contributions 
of our work, which is concerned with selecting a single information source among a 
set of candidate sources. Furthermore, aspects of incomplete information and non- 
monotonic constructs to overcome it were not addressed in SIMS, nor a method 
similar to query description. 

The Carnot project at MCC ( |Singh et al. 1997|iroIIet et al. 1991l|Huhns and Singh 1992D 
was an early effort to provide a logically unifying view of enterprise-wide, dis- 
tributed, and possibly heterogeneous data. The Carnot system has a layered archi- 
tecture, whose top layer consists of semantic services providing a suite of tools for 
enterprise modeling, model integration, data cleaning, and knowledge discovery. The 
Model Integration and Semantics Tool (MIST) is used for creating mappings be- 
tween local schemas and a common ontology expressed in Cyc (jLenat and Guha 1990|l 
or in a specific knowledge representation language, which is done once at the time 
of integration. Besides relational databases, also knowledge-based systems (with 
an extensional part containing facts and an intensional part containing rules) may 
be integrated and, moreover, play a mediator role between applications and differ- 
ent databases. As an important feature, local database schemas remain untouched, 
and queries to them are translated to the global schema and back to (other) local 
schemas for data retrieval. Similar to SIMS, Carnot aims at providing a uniform 
and consistent view of heterogeneous data. A selection of information sources for 
query answering, based on similar criteria and methods as in our approach, is not 
evident. 

InfoSleuth ( |Bayardo et al. 1997| IFowler et al. 19991 INodine et al. 20n3|l . which 
has its roots in Carnot, is an agent-based system for information discovery and re- 
trieval in a dynamic, open environment, broadening the focus of database research 
to the challenge of the World-Wide Web. It extends the capabilities of Carnot to an 
environment in which the identities of the information sources need not be known 
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at the time of generating the mapping. In this approach, agents are the constituents 
of the systems, whose knowledge and their relationships to each other are described 
in an InfoSlcuth ontology. Decisions about user-query decompositions are based on 
a domain ontology, which is selected by the user and describes knowledge about the 
relationships of the data stored in the sources that subscribe to the ontology. As 
for selection of information sources, special broker agents provide, upon request, 
information about which resource agents (i.e., information sources behind them) 
should be accessed for specific information sought. The broker performs a semantic 
matchmaking of the user request with the service descriptions of the provider agents 
(which may be viewed as an advanced yellow-pages service), aiming at ruling out, 
by means of constraints (e.g., over the range of values, existing attributes, etc.), 
all sources which will return a nil result. To this end, it must reason over explic- 
itly advertised information about agent capabilities to determine which agent can 
provide the requested services. The broker translates KIF statements into queries 
in the LDL-f-|- deductive database language, which are submitted to an LDL-|— |- 
engine for evaluation. In this way, rule-based matching is facilitated. Our approach 
differs significantly from InfoSleuth, and is in fact to some extent complementary to 
it. Indeed, the descriptions of constraints and other semantic criteria in InfoSleuth 
for selecting an information source are at a very low level. Even if the LDL-|--f 
language, which can emulate non-stratified negation via choice rules, is used for 
rule-based matchmaking, there is no special support for dealing with contexts, user 
preferences, or optimization constructs as in our approach. Furthermore, it is not 
evident that InfoSleuth agents are programmed using a declarative language which 
provides similar functionalities for discriminating among different sources compli- 
ant with the constraints. Instead, our formalism might be mapped to LDL-I--I- by a 
suitable transformation and thus provide a plug-in module for realizing semantically 
richer and refined brokering in InfoSleuth with a well-defined, formal semantics and 
provable properties. 

The Information Manifold ^Kirk et al. 19951 [Levy et al. T995| [Levy et al. T996| ) 
is a system for browsing and querying multiple networked information sources. 
Its architecture is based on a rich domain model which enables the description of 
properties of the information sources, such as their addresses, the protocols used 
to access them, their structure, etc., using a combination of the CLASSIC descrip- 
tion logic ( |Borgida et al. 1989f ), Horn rules, and integrity constraints. An external 
information source is viewed as containing extensions of a collection of relations, on 
which integrity constraints may be imposed, and which are semantically mapped 
by rules to the relations in the global knowledge base. Information sources may be 
associated with topics, allowing to classify the former along a hierarchy of topics in 
the domain model. This mechanism can be used for deciding retrieval of a source 
for related queries. Like in SIMS, the user may pose queries in a high-level language 
on the global schema, which are mapped to queries over the local sources. The In- 
formation Manifold focuses on optimizing the execution of a user query, accessing 
as few information sources as necessary, where relevance is judged on criteria in- 
volving the (static) semantic mapping, and on combining the results. However, no 
qualitative selection similar to the one in our approach is made, and, in particu- 
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lar, no user preferences or nonmonotonic rules (including default contexts) can be 
expressed by constructs in the language. 

Infomaster IjGenesereth et al. 1997|) provides integrated access to multiple dis- 
tributed heterogeneous information sources on the Internet, which gives the illu- 
sion of a centralized, homogeneous information system in a virtual schema. The 
system handles both structural and content translations to resolve differences be- 
tween multiple data sources and the multiple applications for the collected data, 
where mappings between the information sources and the global schema are de- 
scribed by rules and constraints. The user may pose queries on the virtual schema, 
which are first translated to queries over base relations at the information sources 
and then further rewritten to queries over site relations, which are views on the base 
relations, by applying logical abduction. The core of Infomaster is a facilitator that 
dynamically determines an efficient way to answer the user's query employing as few 
sources as necessary and harmonizes the heterogeneities among these sources. How- 
ever, like in the other information systems above, neither rich meta-data about the 
quality of information sources is considered, nor preferences or context information 
is used to heuristically discriminate between optional choices. 

8.2 Other work 

More related to our approach than the methods in the previous subsection is the 



work by Huffman and Steier (19951, which outlines an interactive tool for informa- 
tion specialists in query design. It relieves them from searching through data-source 
specifications and can suggest sources to determine trade-offs. However, no formal 
semantics or richer domain theories, capable of handling incomplete and default 
information, is presented. 



Remotely related to our work are the investigations by Fuhr (19991, presenting 
a decision-theoretic model for selecting data sources based on retrieval cost and 
typical information-retrieval parameters. 



Goto et al. (2001 ) consider a problem setting related to ours, where source de- 
scriptions include semantic knowledge about the source. In contrast to our work, 
however, a query is viewed merely as a set of terms, and a source description is a 
thesaurus automatically constructed from the documents of the source. A further 
thesaurus, WordNet IjFellbaum 1998|l . is used for the source evaluation algorithm, 
which is based on the calculation of weighted similarity measures. The main dif- 
ferences to our approach are that the selection method is not declarative and just 
numeric, semantic knowledge is limited to a thesaurus, and no further background 
knowledge, reasoning, or semantic query analysis is involved. 

Semantic analysis of queries has been incorporated to document retrieval by 



Wendlandt and Driscoll (19911. Starting from conventional information-retrieval meth- 
ods that accept natural-language queries against text collections and calculate sim- 
ilarity measures for query keywords, semantic modeling was introduced by trying 
to detect entity attributes and thematic roles from the query to the effect of a 
modified similarity computation. While richer ontological knowledge than thesauri 
is used, source descriptions have no semantic knowledge. Again, the approach is not 
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declarative but numeric in nature, and neither rich domain theories nor automated 
reasoning is involved. 

FAQ FINDER l|Burke et al. 1995|l is a natural-language question-answering sys- 
tem that uses files of frequently-asked questions (FAQ) as its knowledge base. It uses 
standard information-retrieval methods to narrow the search to one FAQ file and 
to calculate a term-vector metric for the user's question and question/answer pairs. 
Moreover, it uses a comparison of question types in a taxonomy derived from the 
query, and a semantic similarity score in question matching. The latter is calculated 
by passing through the hypernym links, i.e., is-a links, through WordNet. 

Recent proposals for Web-based information retrieval built on ontology-based 
agents which search for, maintain, and mediate relevant information for a user 
or other agents are discussed by Luke et al. (1997|), [Sim and Wong (20011, and 



Chen and Soo (20011. More specifically, [Sim and Wong (2001|) describe a society 



of software agents where query-processing agents assist users in selecting Web 
pages. They search for URLs using search engines and ontological WordNet re- 
lations for query specialization or generalization to keep the number of located, 
relevant URLs within given limits. An architecture for ontology-based information- 



gathering agents appears also in the work of Chen and Soo (2001 1, but here special 
domain search engines and Web documents are used as well. Ontologies are rep- 
resented in a usual object-oriented language, and queries are partial instances of 
ontological concepts. 

9 Conclusion 

In this paper, we have presented a knowledge-based approach for information-source 
selection, using meta-knowledge about the quality of the sources for determining 
a "best" information source to answer a given query, which is posed in a formal 
query language (as considered here, XML-QL). We have described a rule-based 
language for expressing source-selection policies in a fully declarative way, which 
supports reasoning tasks that involve different components such as background and 
ontological knowledge, source descriptions, and query constituents. Furthermore, 
the language provides a number of features which have proven valuable in the 
context of knowledge representation, viz. the capability of dealing with incomplete 
information, default rules, and preference information. 

We have developed a novel method for automated query analysis at a generic level 
in which interesting information is distilled from a given query expression in a formal 
query language, as well as an approach to preference handling in source selection, 
which combines implicit rule priorities, given by the context of rule applications, and 
explicit user preferences. As pointed out previously, context-based rule application 
is a different concept as inheritance-based reasoning — to the best of our knowledge, 
no similar approach for handling default-context rules has been considered before. 
We presented a formal model-theoretic semantics of our approach, which is based 
on the answer-set semantics of extended logic programs. Furthermore, we analyzed 
semantical and computational properties of our approach, where we showed that 
source-selection programs possess desirable properties which intuitively should be 
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satisfied. We emphasize that for other, related approaches no similar results are 
evident, since lacking a formal semantics makes them less accessible to reason about 
their behavior. 

The results that we have obtained in the implementation of the experimental 
movie application are encouraging, and suggest several directions for further work. 
One issue concerns the supply of rich background and common-sense knowledge. 
The coupling with available ontology and common-sense engines via suitable inter- 
faces is suggestive for this purpose. Extensions of logic programs under the answer- 
set semantics allowing such a coupling have been realized, e.g., by lEiter et al.l 
1)20041 I2005bl I2005a|l . Also, other recent efforts aim at mapping description log- 
ics underlying different ontology languages to logic programs l|Grosof et al. 20031 
IMotiketal. 207131 ISwitt 200411. 

Another direction for further work involves the application of our results in the 
context of information integration and query systems. They might be valuable for 
enriching semantic brokering in open agent-based systems, but also for more tra- 
ditional closed systems in which information sources must be manually registered. 
In particular, the advanced information-integration methods, employing extended 
logic programming tools, developed within the INFOMIX project is a natural can- 
didate for incorporating a heuristic source-selection component. ^^ 

Our results are also relevant for adaptive source selection which is customized, 
e.g., by user profiles. This subject is important for realizing personalized information 
systems in a dynamic environment, which, to a large extent, involve user preferences 
and reasoning with incomplete information and defaults, as well as dynamic updates 
of source descriptions. 
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Appendix A The XML DTD for the movie databases 

<! ELEMENT MovieDB (Movie I Actor I Director I Screenwriter I 

Composer I Person I AwardlFilmf estival)*> 

<! ELEMENT Movie (Title, AlternativeTitle* .ReleaseDate?, 

RunningTime? , Culture? , LeadingRole* , Role* , Actor* , 
Director* .Screenwriter* .Soundtrack* .Review* . Award* )> 

<!ATTLIST Movie 



See |http://sv. mat ■ uiiical.lt/infomix/ [for details about INFOMIX. 



40 Thomas Eiter, Michael Fink, and Hans Tompits 

Genre (Action I Animation I Classic I Comedy I CowboyWestern I 
CultMovie I Documentary I Experimental I FilmNoir I 
Horror I Romance I SciFiFantasy I Series I Silent I Travel I Other) 
#IMPLIED Language CDATA "English"> 

< ! ELEMENT Person (FirstName* , LastName , BirthDate? , Country? , Biography?) > 

<!ATTLIST Person ID ID #REQUIRED Gender (male I female) #IMPLIED> 

<!ELEMENT Award (AwardTitle.Date.AwardType?, AwardCategory?)> 

<! ELEMENT Character (#PCDATA)> 

<! ELEMENT Filmfestival (#PCDATA)> 

<! ELEMENT Actor (Award*) > 

<!ATTLIST Actor Personalia IDREF #REQUIRED> 

<! ELEMENT Director (Award*) > 

<!ATTLIST Director Personalia IDREF #REQUIRED> 

<! ELEMENT Screenwriter (Award*) > 

<!ATTLIST Screenwriter Personalia IDREF #REQUIRED> 

<! ELEMENT Composer (Award*) > 

<!ATTLIST Composer Personalia IDREF #REQUIRED> 

<!ELEMENT Soundtrack (Title, Composer*, Award*)> 

<! ELEMENT Biography (#PCDATA)> 

<! ELEMENT AlternativeTitle (#PCDATA)> 

<! ELEMENT Title (#PCDATA)> 

<! ELEMENT FirstName (#PCDATA)> 

<! ELEMENT LastName (#PCDATA)> 

<! ELEMENT BirthDate (Date)> 

<! ELEMENT Date (#PCDATA)> 

<! ELEMENT Country (#PCDATA)> 

<! ELEMENT AwardCategory (#PCDATA)> 

<! ELEMENT AwardType (#PCDATA)> 

<! ELEMENT AwardTitle (#PCDATA)> 

<! ELEMENT ReleaseDate (Date)> 

<! ELEMENT RunningTime (#PCDATA)> 

<! ELEMENT LeadingRole (Character .Award*) > 

<!ATTLIST LeadingRole Actor IDREF #REQUIRED> 

<! ELEMENT Role (Character , Award* )> 

<!ATTLIST Role Actor IDREF #REqUIRED> 

<!ELEMENT Review (ReviewText ,Rating?)> 

<! ELEMENT ReviewText (#PCDATA)> 

<! ELEMENT Rating (#PCDATA)> 

<! ELEMENT Culture (#PCDATA)> 



Appendix B Query description 

In what follows, we provide details about the query description predicates and the 
query-analysis program. 
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B.l Low-level predicates 

The query and its syntactic subqueries are named by constants (e.g., gi,g2v--)- 
The facts R{Q) are formed using the foUowing predicates: 

• sub_query{Q' , Q): Q' is a structural subquery of query Q (possibly itself a 
subquery) ; 

• query _cand{Q): identifies the overall query; 

• source{S, Q): query Q accesses source 5"; 

• db_name{S): source 5' is a database; 

• whereRef{0,T,P,Q): an IRP O references an item under element T and 
remaining path P in the where part of query Q; 

• suhpathiJD, Ti, Pi, T2, P2): the path Ti/Pi is a direct subpath of T2/F2 in the 
IRPO; 

• whereRefCmp(Oi, R,02)'- the items of IRPs Oi and O2 are compared using 
operator R; 

• whereCnip{0, R, V): the item of IRP O is compared to value V using opera- 
tor R; 

• consRef{0,T,P): the item of IRP O is constructed under element T and 
remaining path P in the (answer) construction part of query Q. 

R{Q) must respect that query languages may allow for nested queries. However, 
in a query expression, an outermost query as the "root" of nesting should be iden- 
tifiable, as well as structural (syntactic) subqueries of it. They are described using 
query _cand and sub -query, respectively. 

Along an IRP, item references relative to a position are captured by the whereRef 
predicate, and suffix inclusions for this IRP are stored as subpath facts. The pred- 
icates whereRefCmp and whereCnip mirror the comparison of two items and the 
comparison of an item with a value, respectively. Items that occur in the construc- 
tion part of a query are also identified by an IRP and stored using consRef . 

Example 11 

The low-level representation R{Q) of the query in Example ^ contains 

sub_query{q2,qi), query _cand{qi) , source{^^MovieDB'\q2), and 
db_name ( "MovieDB" ) , 

and, e.g., for the third IRP, 03, which references ''LastName" , the facts: 

whereRef {03, ^^LastName^\ " ",92); 
whereRef {oy,, ^^Personalia^\ ^^LastName^\q2); 
whereRef {oy,, ^^Director^\ ^^ Personalia/ LastName^\ q2)', 
whereRef {03, ^^Movie", ^^ Director/ Personalia/ LastName'\q2); 
whereRef {03, ^^MovieDB", ^^ Movie/ Director / Personalia/ LastName" , 92); 
subpath(o3, ^^LastName" , " ", ^'Personalia^\ "LastName^^); 
subpath{o3, ^^ Personalia^', ^^LastName", "Director" , "Personalia/ LastName"); 
subpath{o3, "Director" , "Personalia/ LastName" , "Movie" , 
"Director / Personalia/ LastName"); 
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subpath{o3, ^^Movie" , ^^Director/ Personalia/ LastName" , ^^MovieDB" , 

^^Movie/ Director / Personalia/ LastName^^); 
whereCmp{o3, equal, ^^ Hitchcock''^). 

The complete low- level representation R{Q) of the query is given in [Appendix C| 



(cf. also Eiter et al. (2003 \ or Fink (2002 1 ) 



B.2 High-level predicates 

The following high-level description predicates are defined: 

• query {Q): identifies an "independent" (sub-)query Q (i.e., Q is executable on 
some source), which is, moreover, not a purely syntactic subquery (i.e., which 
is not embraced by a sourceless query Q' merely restructuring the result of 
Q; for details, cf. the explanation of rules qas-qai2 of Ilqa below); 

• crefiO, C, P, Q): states that (C, P) is a CRP for Q via IRP O in the where- 
part of Q; 

• occurs{0, V): the value V is associated with an IRP O in the overall query; 

• selects{0 , R,V): like occurs, but details the association with a comparison 
operator R; 

• constructs{0 , 1, P): states that the item of IRP O, by use of a variable, also 
appears in the construct-part of the global query, as an item / under path P 
(which may be different from the path in the where-part); 

• joins{0i,02, R)- records (theta-)joins of (or within) queries between IRPs Oi 
and O2 under comparison operator R. 

Example 12 

For the query in Example ^ we have query{qi) but not query{q2), since the em- 
bracing query gi has no source and merely structures the result of q2 . The following 
cref facts result from oi and 03: 

cre/(oi, '"MovieDB", "Movie/ Title" ,qi); 

cref {01, '"Movie", ''Title" ,qi); 

cref{o3, "MovieDB" , "Movie/ Director / Personalia/ LastName" , qi); 

cref{o3, "Movie", "Director / Personalia/ LastName^\ qi); 

cref{o3, "Director", "Personalia/ LastName" ,qi); 

cref{o3, "Person", " LastName" , qi) . 

Here, "MovieDB" , "Movie" , "Director" , and "Person" are concepts given by the 
ontology, and "Personalia" is known to be a synonym of "Person" (cf. Appendix lB.3l 
for further discussion). 

The fact occurs^o^, "Hitchcock") states that value Hitchcock is associated with 03. 
This is detailed by selects{o3, equal, "Hitchcock"), where equal represents equality. 
For the constructs predicate, the fact constructs{oi , "Movie" , " " ) is included. There 
are no joins facts since the query has no join. The complete high-level description 



is given in |AppcndiJrcl (cf. also [Eiter et al. (2003| ) or [Fink (2002| |). 
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B.3 Query-analysis program 

The query-analysis program Ilqa is composed of the following groups of rules. The 
first rules enlarge the low-level predicate subpath as follows: ^^ 

qai : subpath{0,Ti, Pi,T3, P^) ^ subpath{0,Ti, Pi,T2, P2), 

subpath{0, T'2 , -P2 , 73 , -P3 ) ; 
qa2 : subpath{0,T, Pi,T2, P2) ^- subpath{0,L,Pi,T2T P2), synonym{L^T)\ 
qa^ : subpath{0,Ti, Pi,T, P2) ^- subpath{0,Ti, Pi, L, P2), synonym{L,T). 

Rule qai expresses transitivity for elements occurring in paths, and qa2 and qag, 
deal with synonyms, which is imported ontological knowledge; synonym applies to 
all pairs of synonymous element names (e.g., names of IDREF attributes'^). 

The following two rules define useful projections of low-level predicates: 

504 : has _source{Q) ^- source{_,Q)] 
qa^ : is_sub_query{Q) ^- sub_query{Q, _). 

Using them, an auxiliary predicate iquery^cand is defined for candidates which 
may satisfy the query predicate; these are the overall query and subqueries having 
a database or a document as its source: 

qae : iquery_cand{Q) ^- query _cand{Q); 

qaj : iquery _cand{Q) ^- is sub -query {Q), source{Z,Q), db_name{Z). 

Concerning the high-level predicates, independent, separate queries are specified 
by respecting the nesting structure: 

top -query {Q,Q)] 

iquery -cand{Q), not is sub -query (Q); 

iquery -cand{Q), sub-query {Q, S), source{Z, S): 

sub-query(S, Z), iquery -cand{S) , top-query{Z, Q), 
not has -Source{Z)] 

sub-query (S, Z\ not iquery -cand(S\ 
top-query(Z, Q). 

Rule qag expresses the property that a query is considered to be independent 
if it is the topmost independent query of itself. This is the case if the query is a 
candidate for a separate query and it is either the outermost query (dealt with 
by Rule qag) or a direct structural subquery of a query to a source (expressed by 
Rule qaio). Moreover, qaio intuitively states that a candidate query nested within 
another query is viewed as a separate query only if the nesting was not for purely 
syntactic reasons, i.e., it has its own source. In case of a purely syntactic subquery, 

^^ In a clean separation of R{Q) and the high-level description, a fresh predicate would be in 
order here. However, it is convenient and economic to re-use the predicate subpath, as it is only 
enlarged. 

^^ If in a DTD an attribute is declared of type IDREF, this means that its value is the identifier 
of another element. 
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query {Q) 


qag 


top -query (Q,Q) 


qa-io 


top -query (Q,Q) 


qaii 


top -query {S,Q) 


qai2 


top -query {S,Q) 
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qai3 


cref{0,T,P,Q) 


qai4 


cref{0,T,P,Q) 


qai5 


constructs{0, T, P) 


qai6 


constructs{0, T, P) 


qan 


has_constructs{0) 


qai8 


has_constructs{0) 


qai9 


constructs {O, " ", " ") 


qa2o 


occurs{0,V) 


qa2i 


selects{0,C,V) 


qa22 


joins{Oi,02,C) 



or if a nested query is not a candidate for a separate query, its topmost independent 
query is the one of the embracing query, as taken care of qan and qai2, respectively. 
The next rules define the remaining high-level description predicates. The auxil- 
iary predicate has -constructs guarantees that at least one constructs fact is gener- 
ated for each context reference constructed in the query answer. 

whereRef {O , T, P, S), class (T), 
top -query {S,Q); 

whereRef{0, L, P, S), synonym{L, T), 
classlT), top -query {S , Q); 

consRef {O , T, P), class (T); 

consRef (O , L, P), synonym{L, T), class{T); 

consRef {O , T, P), class (T); 

consRef {O , L, P), synonym{L, T), class{T); 

consRef {O , _, _), not has -constructs (O); 

whereCmp{0 , C, V); 

whereCmp(0 , C, V); 

whereRefCmp{Oi, C,02)- 

Note that some rules reference the ontology predicate class. A fact class{e) should 
exist in (or being entailed by) the domain ontology for all elements e that are 
considered to be concepts. 

When queries are joined over CRPs, then some of the occurrence, selection, and 
construction information of one CRP is also valid for the other. Hence, we can build 
a form of a closure over joined CRPs, which is expressed by the following rules: 

equal), constructs {O 2, T,P); 
equal), constructs {Oi,T,P); 
C), occurs {O2, V); 
C), occurs {Oi, V); 
equal), selects(02, C, V); 
equal) , selects {Oi ,C,V); 
notequal), selects{02, equal, V); 
notequal), selects{Oi, equal, V). 

We remark that, as easily seen, the rules of liqa form a locally stratified logic 
program, and thus Ont U Tiqa U R{Q) has a unique answer set. 

Example 13 

Let us consider how the high-level fact cref{o3, ^^Person^\ ^^ LastName^\ qi) is de- 
rived in Hqa, given R{Q) of the query in Example ^ 

Since query -cand{qi) is in R{Q), we obtain, by qa^, iquery-cand{qi). Since the 
fact is sub -query {qi) is not derivable, qaio yields top -query {qi,qi) (i.e., stating that 
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qa25 ■■ 


occurs (Oi, V) <r 
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qa26 ■ 


occurs {O2, V) ^ 


— joins (Oi 


O2, 


qa27 ■ 


selects{Oi,C,V) ^ 


— joins (Oi 


O2, 


qa28 ■ 


selects{02,C,V) ^ 


— joins (Oi 


O2, 


qa29 ■■ 


selects{Oi, notequal, V) <- 


~ joins {Oi 


O2, 


qaso ■■ 


selects{02, notequal, V) <- 


~ joins {Oi 


O2, 
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qi is independent). Next, we can derive is-sub_query{q2) by means of gas, and thus 
iquery _cand{q2) in view of gay, given that R{Q) includes the facts sub_query{q2, qi), 
source{^^MovieDB" ,q2), and db_name{^^MovieDB"). Since gi has no source (i.e., 
has-Source{qi) is not derivable), we can derive top -query {q2, qi) from gag. The fact 
cref {03,^^ Person" ,"LastName'\qi) is now derived by means of gai4, making use 
of whereRef {03,^^ Personalia" ,'^LastName" ,q2) from R{Q), together with the facts 
synonyni{^^ Personalia" ," Person") and class{^^ Person") from the ontology, and the 
derived fact top -query {q2,qi). Note that cref {03,^^ Personalia" /^LastName" ^qi) is 
not derivable, as class (^^ Personalia") ^ Ont. 

Appendix C Query-representation for Example ^ 

The low-level representation R{Q) of the query in Examplencomprises the following 
facts: 

db-name{''MovieDB"); 

query -cand{qi); 

sub -query {q2,qi); 

source{^^ MovieDB" ,52); 

w;/ierei?e/(oi,"ra/e","",g2); 

whereRef (oi , ^^ Movie" , " Title" , g2 ) ; 

whereRef (oi , ''MovieDB" , ''Movie/ Title" , g2) ; 

subpath{oi , " ntle" , " " , "Movie" , " Title" ) ; 

subpath{oi , ''Movie" , " Title" , "MovieDB" , "Movie/ Title"); 

whereRef {o2,"FirstName" ," ", g2); 

whereRef (02 , "Personalia" , " FirstName" , g2) ; 

whereRef {02," Director" ," Personalia/ FirstName" ,92); 

whereRef {02," Movie" ," Director / Personalia/ FirstName" , g2); 

whereRef {02, "MovieDB" , "Movie/ Director/ Personalia/ FirstName" , g2); 

subpath{o2 , "FirstName" , " " , "Personalia" , "FirstName" ) ; 

subpath{o2 , "Personalia" , "FirstName" , "Director" , "Personalia/ FirstName" ) ; 

subpath{o2 , "Director" , "Personalia/ FirstName" , "Movie" , 

" Director / Personalia/ FirstName" ) ; 
subpath{o2 , "Movie" , "Director / Personalia/ FirstName" , "MovieDB" , 

"Movie/ Director / Personalia/ FirstName")] 
whereRef (03 , "LastName" , " " , g2 ) ; 
whereRef (03 , "Personalia" , "LastName" , g2 ) ; 
whereRef {03," Director" ," Personalia/ LastName" ^ 52); 
whereRef {03," Movie" , " Director / Personalia / LastName" ,92); 
whereRef {03, "MovieDB" , "Movie/ Director/ Personalia/ LastName" , 52); 
subpath{o3 , "LastName" , " " , "Personalia" , "LastName" ) ; 
subpath{o3 , "Personalia" , "LastName" , "Director" , "Personalia/ LastName")] 
subpath{o3 , "Director" , "Personalia/ LastName" , "Movie" , 

"Director / Personalia/ LastName"); 
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subpath{o3 , " Movie^^ , ^^ Director / Personalia/ LastName" ; 

^^MovicDB^^ , "Movie/ Director / Personalia/ LastName" ) ; 
where Cmp{o2 , equal , "Alfred"' ) ; 
whereCmp{o3, equal," Hitchcock^^); 
consRef{oi," Movie"," "); 
consRef{oi , "MovieList" , "Movie" ) . 

The high-level description, except for auxiliary predicates and the completion of 
the subpath predicates, is given by the following facts: 

query (qi); 

cref (oi , "MovieDB" , "Movie/ Title" ,qi); 

cref (oi , "Movie" , " Title" ,qi); 

cref {02," MovieDB" ," Movie/ Director/ Personalia/ FirstName" ,qi); 

cref {02, "Movie" , "Director/ Personalia/ FirstName" ,qi); 

cref {02," Director" ," Personalia / Fir stName" ,qi); 

cref {02," Person" ," FirstName" ,qi); 

cref {oz," MovieDB" ," Movie/ Director / Personalia/ LastName" ,qi); 

cref (o^, "Movie" , "Director/ Personalia/ LastName" ,qi); 

cref {oz," Director" ," Personalia / LastName" ,qi)', 

cref (03 , "Person" , "LastName" ,qi); 

occurs (02 , "Alfred" ) , occurs (03 , "Hitchcock" ) ; 

selects{o2, equal, "Alfred"); 

selects{o3, equal ," Hitchcock" ); 

constructs{oi, " Movie" ," "). 

Appendix D Further properties of source-selection programs 

In order to realize the construction of £{S, Q) in terms of a single logic program, we 
introduce a set N of constants serving as names for rules, and a new binary pred- 
icate pref{-, ■), defined over N , expressing preference between rules. The extended 
vocabulary Asei U {pref{-, •)} U iV is denoted by Asei- We furthermore assume an 
injective function n(-) which assigns to each rule r G Hq a name n(r) G N. To 
ease notation, we also write n^ instead of n(r). Finally, Litpref denotes the set of 
all literals having predicate symbol pref . Note that Lit pre/ n Litsei — 0- 

Theorem 4 

Let S ~ (nqa,ndom,nsd,nse;, <«) be a selection base, Q a query, and £(iS, Q) = 
(Hq, <). Furthermore, let Vls{.Q) = ^qa U R{Q) U Wdom U W^d U Hq. Then, there 
exists a logic program liobj (Q) over a vocabulary A ^ Asei such that every answer 
set X of Us{Q) U UobjiQ) satisfies the fohowing conditions: 

1. X n Litpref represents <, i.e., pref{nr, n^') E X iS r < r'; and 

2. X O Litsei is a selection answer set of {Usei, <u) for Q with respect to 5 iff X 
is a preferred answer set of the prioritized program (IIs{Q) U UobjiQ), <)• 
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Proof 

We give a description of HobjiQ) but omit a detailed argument that it satisfies 
the desired properties. Informally, HobjiQ) consists of two parts: the first, n^e;, is 
derived from Ils{Q) and takes care of computing the relevant rules for Q, by utiliz- 
ing weak constraints; the second part, Ilpre/, is a (locally) stratified logic program 
determining the relations of Definition |S| We start with the construction of Hrei- 

For each predicate p G Asei and each r £ Ilg, we introduce a new predicate 
p^ of the same arity as p. In addition, we introduce a new atom relr, informally 
expressing that rule r is relevant. If a is either a literal, a set of literals, a rule, 
or a program, then by [a]r we denote the result of uniformly replacing each atom 
p{xi, ,... ,Xn) occurring in a by p^(xi, . . . ,a;„). 

For each r 6 IIq, we define a program 11^ containing the following items: 

1. each rule in \Uga U R{Q) U Usd U 11^0™]^; 

2. the rule relr <— \B^r)~\r^, and 

3. the extended default-context rules 



default -clasSr{0,C,Q) <— crefr{0,C,-,Q), 



default_path^(P,P,Q) ^- crefr{0,-,P,Q). 

As easily checked, r is relevant for Q iff 11^ has some answer set containing rel^- 
Now, lirei is defined as the collection of each of the programs 11^, together with 
weak constraints of form 

<^ notrelr[l:m+l], (Dl) 

for every r G Ilg, where m is the maximal priority level of the weak constraints 
occurring in 11°^;. Since, for any ri,r2 G Ilg with ri ^ ^2, the programs 11^ and 
Tir2 are defined over disjoint vocabularies, and given the inclusion of the weak 
constraints IjDlj) in 11^/, we obtain that H-rei satisfies the following property: 

(*) for every answer set X of Tlrei and every r G Ilg, r is relevant for Q iff relr G X . 

These answer sets are used as inputs for the program Ilpre/, which is defined next. 
Let pr{n,m) and pr'{n,m) be new binary predicates, where n,m are names. 
Then, Upref consists of the following rules: 

1. pr{nr-^, nr^) ^- relr^^ relr2, for every ri,r2 G Ilg such that either ri <„ r2 or 
ri,r2 satisfy Conditions (Oi) or (O2) of Definition |H| 

2. pr(nri,nr2) ^~ relrj^,relr2, subpath{o,ti,pi,t2,P2), for every ri,r2 G Ilg such 
that cref{o,ti,pi,q) G B{ri) and cref{o,t2,P2,q) G B{r2)] and 

3. the rules 

pr'{N^,N2)^pr{Ni,N2), 

pr'{Ni,N3) ^ pr'{Ni,N2),priN2,N3), 

pref{Ni,N2) ^ pr{Ni, N2),not pr'{N2, Ni), 

pref{N,,N3) ^ pref{Ni,N2),pref{N2,N^). 

Obviously, Ilpre/ is a (locally) stratified program. Moreover, in view of Con- 
dition (*), and since Hrei is independent of Tipref and Yis{Q) is independent of 
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^objiQ), for every answer set X of IIs{Q) U HobjiQ) — ^s{Q) U 11^; U Upref, we 
have that (i) pr{nri,nr2) G X iff ri < r2, (ii) pr'{nr^,nr2) G X iff ri <* 7-2, and 
(iii) pref{nr^, n^^) G X iff ^i < r2. This proves Condition^of the theorem. 

As for Conditional consider some answer set X oiIis{Q)^^obj{Q)- Since lisiQ) 
is independent of liobj (Q) , X is of form YUY', where Y is an answer set of Us (Q) 
and Y' is a set of ground hterals disjoint from Litsei- Hence, X n Litsei ~ Y . 
According to Theorem^ F is a selection answer set of (JiseU <«) for Q with respect 
to <S iff y is a preferred answer set of {Iis{Q)i <)• But it is easily seen that the 
latter holds just in case F U F' is a preferred answer set of (H^ U 'Robj {Q), <)• This 
proves the result. D 

We note the following comments. First, Ilrei can be simplified by taking indepen- 
dence of subprograms of HsiQ) and possible uniqueness of answer sets for them 
into account. For example, if Hsd has a unique answer set, then we may use in each 
program H,. simply \a\r = a, for each literal over Asd- In particular, if the program 
liqa U R{Q) U ^dom. U I^sd h^s a uniquc answer set (e.g., if this program is locally 
stratified), then we may simply take as \irei the program liqa U R{Q) U IVdom. U lisd 
together with all rules relr ^ B'^{r), for r G IIq. 

Second, the program Tis{Q) U 'Robj{Q) in Theorem 0] represents, via preferred 
answer sets for a dynamic rule preference given by the atoms over pre/, the selection 
answer sets of (JlseU <«) for Q- It can be easily adapted to a fixed program H^ such 
that, for any query Q, the dynamic preferred answer sets of 11^ U R{Q) represent 



the selection answer sets of (Ilsei, <«) for Q (cf. Delgrande et al. (20031 for more 
details on dynamic preferences). 

As for the complexity of source-selection programs, we can derive the following 
result as a consequence of Theorem ^ 

Theorem 5 

Given a query Q and the grounding of Ils{Q) — Hqa U R{Q) U Udom U ^sd U Hq, 
for a selection base S = (II^q, Ildom, 11^^, Ilsej, <„), deciding whether (Hsej, <„) has 
some selection answer set for Q with respect to S is NP-complete. Furthermore, 
computing any such selection answer set is complete for FP'^^. 

Proof 

Obviously, the groundings of the programs Hrei and Upref in the proof of Theorem^] 
are constructible in polynomial time from Q and the grounding of Ils{Q), and so 
is the ground program H', consisting of the groundings of JIs{Q), HreZ, and Upref- 
Furthermore, the preferred answer sets of (II', <) correspond to the selection answer 
sets of (Ilse;, <„). Since deciding whether a prioritized logic program (with no weak 
constraints) has a preferred answer set is NP-complete | |Delgrande et al. 2003| ), it 
follows that deciding whether {Ilsei,<u) has a selection answer set for Q with 
respect to S is in NP. Note that the presence of weak constraints has no infiuence 
on the worst-case complexity of deciding the existence of (preferred) answer sets. 
Moreover, NP-hardness is immediate since the auxiliary rules can form any standard 
logic program. 

From any answer set X of II', an answer set of {Usei, <«) is easily computed. 
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Computing such an X is feasible in polynomial time with an NP oracle, sketched 
as follows. First, compute the minimum vector of weak-constraint violations, v* , i.e., 
the sum of weights of violated constraints at each level, using the oracle, performing 
binary search at each level, asking whether a violation limit can be obeyed. Then, 
build atom by atom an answer set X whose violation cost matches v* using the NP 
oracle. Overall, this is possible in polynomial time with an NP oracle, hence the 
problem is in FP'^^. 



The hardness for FP^^ follows from a reduction given by Buccafurri et al. (2000 1, 
which shows how the lexicographic maximum truth assignment to a SAT instance, 
whose computation is well-known to be complete for FP^^ (|Krentel 1988|l . can be 
encoded in terms of the answer set of an ELP with weak constraints. D 

Note that under data complexity, i.e., where the selection base S is fixed while 
the query Q (given by the facts R{Q)) may vary, the problems in Theorem|Slare in 
NP resp. FP'^^, since the grounding of Ils{Q) is polynomial in the size of S and Q 
in this case. If, moreover, the size of Q is small and bounded by a constant, then 
the problems are solvable in polynomial time, since then the number of rules in the 
grounding of IIs{Q) is bounded by some constant as well. 
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